No Excess Babbage - Design Considerations for the Interface to a Systolic Matrix Processor by Shaw, Timothy
6, \o.15
No Excess Babbage - Design Considerations for the
Interface to a Systolic Matrix Processor
T. Shaw
B.Sc., B.E. (Hons)
A thesis submitted to the Department of Electrical and
Electronic , The IJni rsity of Adelaide, to meet the













1.3 Standard Systolic/Wavefront Arrays
1.3.1 Systolic and Wavefront SIMD Array Processors
7.3.2 Multi-bit SIMD Processing Arrays
1.3.3 Confi.gurable Processing Arrays
1.4 Conclusion
2 The Systern
2.I The University of Adelaide Systolic Processing Array
2.1.1 The Basic Architecture
2.1 .2 The Outer Product
2.L3 Interfacing to the Processing Array .
2.1.4 Processing Element Redundancy
2.2 Evolution of the System
2.3 Conclusion
3 Algorithms
3.1 Memory Subsystem: An Overview
3.2 Matrix Multiplication
3.2.1 Partitioning
3.3 Solution of Sets of Linear Equations
3.4 Gauss-Jordan Elimination
3.4.1 Block Version
3.4.2 Determining the Inverse of the Pivot Block
3.4.3 Implementation on the Proposed Memory Architecture
3.4.4 Inverting Rather Than Solving a Set of Matrix Equations
3.4.5 Solving Sets of Equations That Don't Fit Into the Cache
3.4.6 Iteratively Improving a Set of Solutions
3.5 The Discrete Fourier tansform
3.5.1 Implementation .








































4.I Signed-Digit, or Carry-Propagation Free, Arithmetic
4.I1 Why SD Arithmetic
4.1.2 Sign Magnitude Specifics .
4.1.3 Conversion To and From Signed Digit and Two's Complement
4.!.4 Signed Digit Implementation
4.1.5 VLSI Layout and Implementation
4.2 Byte Divider







5.0.2 Parallelizing and Expanding the Difference Engine
5.0.3 Example Mappings on the Difference Engine




5.1.4 Complete Decrement Units
5.1.5 Multiplexer Selection
5.2 Implementation .




6.2 Applying and Extracting the Outer Product .
6.2.1 Extracting the Outer Product
6.3 Memory Overview





6.4.4 Latency After a Miss .
6.4.5 Write Policy
6.4.6 Implementation .






6.6.2 Read/Write Path .
6.6.3 The Complete Bus Exchanger




















































7.2 Gauss Jordan Elimination and Inversion
7.2.7 Gaussian Elimination for Matrices Smaller than the Cache Size
7.2.2 Gaussian Elimination for Matrices Larger than the Cache Size
7.2.3 Conclusion
7.3 Discrete Fourier Transform
7.3.I Discrete Fourier Tlansform Without Multi-level Memory System
7.4 The Kalman Filter
7.4.L Small Problem Size Running From SRAM
7.4.2 Large Problem Size Running From SRAM
7.4.3 Kalman Filter With a Multi-level Memory Sub-system
7.5 Conclusion
8 F\rture Projects
8.1 A Multi-Processor Teraflop Engine
8.1.1 The Hypercube
8.I.2 Fibre-Optical Interconnection
8.1.3 Utilising the Hypercube - the Proposed Model
8.1.4 Algorithms
8.2 Wavelet Processor
8.2.I Still Picture Compression
8.2.2 Moving Picture Compression
8.3 Conclusion
9 Surnrnary and Conclusion
Appendices
A SCAP Data Sheets
B Proof of Convergence for Iterative Matrix fnversion
8.1 Initialising the Pivot Inverse Iteration
C Broadcasting Data on a Hypercube
C.1 Sirnple Broadcast
C.2 Pipelined Broadcast
C.3 Parallel or Rotated Broadcast




































1.1 Basic Systolic Array
1.2 Timing Ring Diagrarn for a) Systolic Array b) Wavefront Array
1.3 Linear Systolic Array for Convolution
1.4 Array for the Direct Solution of a Set of Linear Equations
1.5 MPP Architecture.
1.6 DAP Architecture
1.7 Saxpy-l Block Structure
i.8 Processing Element Structure
1.9 lMarp Architecture
2.I Processing Element Structure
2.2 Nit¡ble (Digit) Skewing Input Data
2.3 a) Inner Product b) Outer Product
2.4 Block Diagram of Data Controller
2.5 Block Diagrarn of Systern
2.6 Redundant Array a) Before Failure b) After Failure
2.7 Redundancy Using Data Controller
3.1 Block Diagram of Systern
3.2 Four Regions of a Partitioned Matrix Product .
3.3 Memory Access for Matrix Product
3.4 Row Oriented Gauss-Jordan Elimination
3.5 Extracting a) Standard tr) Tlanspose
3.6 Column Oriented Gauss-Jordan Elirnination
3.7 Dividing Sets of Equations Into Blocks





































Converting Positive 2's Co plement to Sigued Digit.
Converting Negative 2's Cornplement to Signed Digi .
Converting Signed Digit to 2's Comple nent.
a) b)..
c) d)...




Karnaugh Maps for First Digit Cell of an Adder
a) b)..
c) d)...
Karnaugh Maps for Last Digit CeU of an Adder with Single fnput
Unlatched Signed-Digit Adder Cell .
HSpice Plots for lJnlatchet Adder Cell







































Latched Signed-Digit Adder CeIl
HSpice Sirnulation for Latched Adder Cell
Latched Adder Cell With Multiplexer
Layout of Unlatched Beginning CelI
HSpice Simulation Results for Unlatched Beginning Cell
Layout of Latched Beginning Cell .
Layout of Latch & Multiplexed Input Beginning Cell
a) Unlatched End Cell With Standard Propagated Outputs
b) Unlatched End Cell \Mith Sign Magnitude Progopated Outputs
a) Unlatched End Celt With Standard Propagated Outputs Simulation Results
b) Unlatched End Cell With Sign Magnitude Progopated Outputs Sirnulation
Results
a) Latched End Cell With Standard Propagated Outputs
b) Latched End Cell With Sign Magnitude Progopated Outputs
a) Latched End CelI With Standard Propagated Outputs Sirnulation Results
b) Latched End cell with sign Magnitude Progopated outputs simulation
Results
Errorgraphsfora) 0.5<81å b) 3< 8<I
Error graph over full range of Q .
Difierence in Errors for Two Schernes
\Mallace-tree Multiplier.
Signed Digit Multiplier.
VLSI Layout of Digit Multiply Cell .





Processing Cell Using Pipelined Multiplier
VLSI Layout of Pipelined 8 x 8 Multiplier
HSpice Simulation Plots a) Input b) Output
Lattice Representation of a Matrix Product
2D Addr Gen.
4D Addr Gen.
Sign Detect Unit Cell. .
Sign Detect Unit
Domino Logic Exclusive-OR Gate
Flow Diagrarn for Decrement Cell . .
Decrement Cell .
f\rll Decrernent Cell
Four fnput Zero Detection Cell
Four-Bit Decrernent Unit
Simulation Results of Four Bit Decrement Cell
Sixteen-Bit Decrement Unit
HSpice Simulation Results for Sixteen Bit Decrernenter
Delta Register Multiplexers .
VLSI Layout of an Eight-bit Address Generator.
a) D0 to D3 b) D4 to D7
c) Q0 to Q3 d) Q4 to QZ & Nes
Simulation of Normal Address Generation
a) D0 to D3 b) D4 to D7
c) Q0 to Q3 d) Q4 to Q7 and Neg Out
Sirnulation of Tlansposed Address Generation .































































































5.19 c) Q0 to Q3 d) Q4 to Q7 and Nes Out
5.19 Sirnulation of Prirne Factored Address Generation




Data Controller f Cac}ne Memory pair .
Internal Structure of IM x 4 Static Column DRAM
Block Diagram of Four 'Way Interleaved DR,AM Subsystem
Initial Tirning of Burst Read
Synchronous DRAM Internal Block Diagrarn
Mernory Bank Configuration using a) Thin & b) Wide DRAMS
Initiating a Read Sirnultaneously to Completing a Write
a) Latch Cell tr) Driver Cell
VLSI Layout of the Driver/TYansrnission Cell
Four-to- One Multiplexer /Demultiplexer
VLSI Section of Four-to-One Mux/DeMux
Bus Exchanger VLSI Layout
Latched Read \Mith Constant Select Signals
Latched Read With Varying Select Signals
Latched Write With Varying Select Signals
Tlansparent Read With Varying Select Signals
TYansparent Write With Varying Select Signals
Perforrnance Figures for Simple Multiplication Model (MFtops) 126
Performance for a)Array Dirnension - 64 b) Load/Store Time : 5ns (MFlops) 126
Performance for Array Dirnension = 64 and Load/Store Tirne : 5ns (MFlops) 127
a)BlockSize:  tr)BlockSize:16 ......I29
c) Block Size: 64 d) Block Size : 51-2 . I29
Performance Estimates for Matrix Multiplication Without Streaming . 129
a)BlockSize:  b)BlockSize:16 ......131
c) Block Size -- 64 d) Block Size : 5L2 . 131
Performance Estirnates for Matrix Multiplication With Streaming 131
a)BlockSize:4 b)BlockSize:16 ......133
c) Block Size: 64 d) Block Size : 5L2 . 133
Perforrnance for Gauss-Jordan Elimination on Matrices that Fit Into Cache 133
a)BlockSize:  b)BlockSize:16 ......135
c) Block Size: 64 d) Block Size : íLZ . 135
Performance for Gauss-Jordan fnversion on Matrices that Fit Into Cache 135
a)BlockSize:  tr)BlockSize:16 ......137
c) Block Size : 64 d) Btock Size : 5L2 . 137
Performance for Gauss-Jordan Inversion on Matrices that Do Not Fit Into Cache137
Performance of Two Dirnension DFT - FLOPs vs DFT length 140
Performance of Partitioned Two Dimension DFT - FLOPs vs DFT length I40
Performance of Partitioned Two Dirnension DFT - FLOPs vs DFT length I42
Performance of Three Dirnension DFT - FLOPs vs DFT length 143
Perforrnanceof Four Dimensional DFT- FLOPs vs DFTlength . . . . . . 144
10ns Cycle Tirne a) Array Size : 32 b) Array Size : 64 . 146
20ns Cycle Time c) Array Size : 32 d) Array Size : 64 . . 146
Simple Kalman Eilter for State Vector Smaller Than Array f)imension 146
10ns Cycle Time a) Array Size : 32 b) Array Size : 64 . . 149
20ns Cycle Tirne c) Array Size : 32 d) Array Size : 64 . . I49
Sirnple Kahnan Filter for State Vector Larger Than Array Dirnension I49
a) Block Size : 4 b) Block Size : 16 151














































































































7.16 Lower Bound on Kalrnan Filter With Multilevel Mernory
Node Labelling for a 3-Cut¡e
a) Extended Sirnple Mernory Structure b) X-Connect Mernory Structure
Processing/transmission Diagram for 1st Partitioning
a) First Multiplication Phase
b) Second Multiplication Phase After First 'Roll
a) 100ns fnternode Delay b) 1000ns Internode Delay
c) 10000ns Internode Delay
Performance Estimates for Multiprocessor Multiplication
Distributed Row-wise Gauss- Jordan Elirnination
a) Latency : 100ns b) Latency : 1000ns
c) Latency: 10000ns
Perforrnance Estirnates for Multiprocessor Gauss-Jordan Elimination .
Four Coefficient Cornpact 'Wavelet Tbansformation Matrix
Pyrarnidical Transforrn and Permute Procedure
Cell frorn a Linear Array for the Wavelet Thansform
Linear Array for n-dimensional Wavelet Tþansforrn
Single Cell for Wavelet Implernentation
9.1 Cornplete System Block Diagrarn
C.1 Spanning TYee for a 4 Dimensional llypercube
C.2 4 Rotated Spanning Trees for 4-Dirnensional Hypercube








Conternporary Systolic Arrays in SIMD Configuration
Conternporary Systolic Arrays in MIMD Configuration
Cornparison of Vector Computers in the 1980's
Cornparison of Vector Cornputers in the 1990's
Cornparison of Massively Parallel Cornputers
Ordering when Nurnber of State Vectors is Small
Ordering when Nurnber of State Vectors is Large
Table of SD representation
Table of SM representation
Digit Outputs for Unlatched Multiplier, b:1
Digit Outputs for Unlatched Multiplier,b:64
Digit Outputs for Unlatched Multiplier,b:L4L
Digit Outputs for Unlatched Multiplier,b:255
2 Dimensional Mappings on Difierence Engine
4 Dimensional Mappings on Difierence Engine
Decrernent CelI Cornbinations
a) Sout & Mout b)Neg-Out
Resultant Addresses for Norrnal Addressing Data
Result of Sirnulation of Addressing for Tbansposed Access
















































2 Dimensional DFT Performance
Partitioned 2 Dimensional DFT Perforrnance
Partitioned 2 Dimensional DFT Perforrnance
Three Dirnensional DFT Perforrnance









Parallel Multiplication Allocation Table





Computational systems used for the solution of large matrix-based numerical problems are
rapidly converging on the limits of technology, and novel architectures are now being sought to
improve perforrnance. In signal processing and control theory, high performa'nce systems a're
required which do not contain the physical size penalty of current supercomputers. To achieve
performance comparable with current supercomputers, a systolic processing array speciflcallv
targeted at matrix applications has been developed at the University of Adelaide.
The work of this thesis involves the problem of delivering and receiving the data moving
between the processing array and the memory subsystem. This involves the reformatting of
existing algorithms to map efficiently onto the matrix array, the design and VLSI layout of a
matrix adclress generation unit using signed digit arithmetic for enhanced performance, and the
block level description of a multiport cached memory system. Performance estimates predict a
modest configuration will perform selected matrix routines in excess of three GigaFLOPs.
IX
Statement of OriginalitY
This work contains no material which has been accepted for the award of any other degree or
diploma in any university or other tertiary institution and, to the best of my knowledge and
belief, contains no material previously published or written by another person except where due
reference has been made in the text.
I give consent to this copy of my thesis, when deposited in the University Library, being




I would especially like to thank my supervisor Mr Michael Liebelt for his direction, advice and
support throughout my work.
The work underta,ken for the degree is ba,sed around a project at the Univerisity of Adelaide
which has been largely carried out by Dr Warren Marwood. Dr Marwood has provided a great
deal of knowledge, enthusiasm and encouragement to me over the past two years' and I extend
my grateful thanks to him as well.
The advice of Dr C.C. Lirn and Dr N. Burgess has been of great benefrt, and led me into
areas of great interest.
Thanks also go to the group of other postgraduate students with whom I have worked, for
their comments and thoughts. Particularly to David Standingford, Richard Beare' Ali Moini,
Andrew Blanksby and the HiPCAT team.
Filally, I woukl like to thank my parents, for the support they have always given me in




This thesis describes some design considerations made during a project at the University of
Adelaide involving the design of a Systolic Processor and Interface. The processor is a specialized
architecture intended for use as a generic matrix processing engine, comprising a large number
of processing elements configured in a two-dimensional an'ay, and some support hardware. The
processor elements themselves are kept very simple, with mote complex operations taking place
in the support hardware. On its own, the processing array performs the simple operation of a
matria outer-ytrod,ucú, which can be used to perform other algorithms by clever addressing and
application of data to the array.
The work presented in this thesis is the c.ulmination of severa,l design 'tasks'. Briefly, they
are:
1. the clevelopment and redesign of various matrix algorithms to nap efficiently to the defined
matrix array structure. This process indicated what hardware features would significantly
enhance or reduce the system performance.
2. the VLSI design of the address generation hardware, which is the critical path in the
processing-a,rray/memory interface. To maxintize the performance of this critical compo-
nent, non-conveutional arithmetic was used to implernertt a,n expanded Marwood Difference
Engine.
3. the specification of a nLentorv subsystem. The organisa,tion of the memory system and
its inter-relationships with the algorithms signiflcantly affects the overall system perfor-
mance. Therefore, a memory system that can support the matrix processing array at
nearly peak speed was designed to extract rnaximum performance while still retaining
relatively moderate cost.
Additionally, a, multiprocessor architecture with a matrix processor as each node is proposed
and its feasibility investigated.
The thesis is organised around the work in such a way that each distinct task has its own
chapter. However, as the tasks are heavily inter-related, there is some overlap among the
chapters. The tasks occupy Chapters 3 to 8. Chapter 2 introduces the architecture of the
system desigl to the time of the start of this project, while conclusions of this project are




Originally proposed by H.T.Kung and C.E.Leiserson in 1978 and clescribed in the landmark text
"Introduction to VLSI Systems" [66], the term'systolic'refers to the systolìc cycle of the heart.
The systole is the contraction of the heart and arteries to expell blood and pass it throughout
the bocly. Sirnilarly, a systolic cornputer systern is one in which data is 'purnped' through the
systolic hardware or array, and is operated on by several sections of the hardware in turn. A
basic systern of this form is shown in t'igure 1.1. Data moves from the nìemory and is passed
to each processing cell in turn until it is finally returned to memory. As can be seen from the
diagram, data is utilised by each cell (in this case six times) for the cost of only one rnenìory
load and one memory store operation. Obviously, for a memory bandwidth limited system, the
speecl-up of a linear systolic system is equal to the number of cells in the array.
Figure 1.1: Basic Systolic ArraY
Systolic alrays have been constmcted in linear. rectangular, tt'iangrtlar and hexagonal con-
flgnrations, to name a few. to solve various spec.ific problems. such as Gaussian analysis, LU
decornposition radar adaptive beam-forming and riata formatting.
The properties of a systolic al'ray ale that the array is constructed of cellular processiug units,
that çan be theoretically repeated inclefinitely in each dimension. The cells contain only loco'|,
fi,ontogeneous inte¡conlections, a,nd operate such that the array exhibits linea'r pipelinability, ie
that the array car be pipelined into as marÌy stages as there ate processing cells.
t.2 'Wavefront Arrays
S.Y. Kung [49] distinguishes between systolic arrays and wavefront arrays by the single feature
of a global clock. The global clock is required for the pumping action of data through a' pipeline,
as each cell passes data, on to the next at cycle intervals. However, if there is no global clock, ancl
the data operations are locally timed, data llow technìques becorne applicable, and the array
takes on a wauefronl approach, ie data is processed in waves that propagate through the array
using local synchroniza,tion. Kung reproduces diagrarns from a report by Q.E. I)olecek entitled
"Pa,rallel processing systenrs for VHSIC." that show the difference 149,22].'l'hese are shown jn
Figure 1.2a) for a, systolic processor and Figure 1.2b) for a wavefront processor.
Kung goes on to compare the wavefront and systolic structures and draws the conclusion






















Figure 1.2: Timing Ring Diagram for a) Systolic Array b) Wavefront Array
Note that a wavefront a,rray that uses a local clock synchronised to an adjacent processor
could be considered to be either a wavefront or a systolic array, depending on the interpretation
of the clock signal. For example, the local clock could be synchronised with and derived from
an input c.lock, and an output clock could be the local clock delayed by one cycle, Thus, the
output clock could be considelecl a data-flow tirning parameter, as it is derived locally and there
is no global propagatiorì parameter. I{owever, as the output clock is derived from the input
clock which is ultimately applied to all the edge cells of the array, the output clock could be
considerecl to be a global clock. It is just a rnatter of interpretation as to which cla'ss some
processing arrays should be classed.
1.3 Standard Systolic/Wavefront Arrays
Initially, systolic arrays inrplernentetl structures designecl to solvc specific algolithrns. Dedicated
to their ta,sk, they were of little use fol anything but the task they were designecl to achieve.
Some examples are shown below.
1. Convolution
Given two sequences of numbers u¡ ard'ur¡ for j : 0,1,...,1{ - 1 ' then the linear
convolution of the two sequences is defiued by [a9]
(1.1)
A linear systolic array that irnplements this algolithm is shown in Figure 1.3.
2. Direct Solution of the Linear System A:t = b
tlsing the Gauss-Jordan elimination scheme, the direct solution of a set of linear equations
can be formecl, Figure 1.4 shows an array that implements this algorithm [7ti]. Da'ta'
inputs are propagated through the array together with a 'conttol' worcl tha,t determines
the operation to be undertaken at the processing cell. The cells can become cluite cornplex
because of the control word and the various operations that are a,vailable, although the
ge¡eral interconnection scheme is onlv suita,ble for a very limitcd sct of algorithms.


















































































Figure 1.4: Array for the Direct Solution of a Set of Linear Equations (from [76])
4
I
1.3.1 Systolic and'Wavefront SIMD Array Processors
The early systolic a,nd wavefront rnachines were often modelled on the Single-Instruction-Muìtiple
Data (SIMD) rnodel [a9]. In these architecl;ures, there is local interconnection between the pro-
cessorsr a,ncl between a processor ancl its local merÌÌory. Each instmction is globally broa'dcast by
a host contloller. and a proc:essing cell has the c.hoice of either exec.uting the instluction or mask-
ing the instruction. Thus, two processing elements ca,n not, be executing different instructiotls
during the same instruction cycle.
Initially, many of the early systolic SIMD arrays opelated in bit-serial form, and were there-
fore refelred to as Bina,ry Array I'rocessors (BAP) [49]. The bit-serial architecture provicles for
flexible data, folmatting, and is efficient in terrns of the memory lesoutces and chip intercon-
nections. as a large nurnber of processing elements fabricated on a chip will require rninimal
I/O interconnections. However, as available chip resources have increased with time and the
menìory bandwidth is now often the bottleneck of a system, there has been a decline in bit-serial
processing architectures. Bxample BAPs inclucle the Massively Parallel Processor (MPP), front
Goodyear Aerospace, the Distributed Array Processor (DAP), from Internationa'l Computer
Limited, and the Geometric-Arithmetic Parallel Processor (GAPP), frorn NCR [38,49,24].
The Massively Parallel Processor
Comrnissioned by NASA for the processing of satellite irnages, and running considerably over
budget in the process, the MPP cornprises 16,384 bil -serial processing elements, configured as a
128 x 128 array. An additional four rows of processing elernents are includecl for redundancy in
tlreeventoffailure,sotheactualarrayisinfact l2Sxl32,althoughtheuseronlyseesthesquare
l28xl2B array. The PÐs in the NIPP can accept inputs from one of several neatest neighbours
and allows overlapping data I/O and array operations. The PÐs are custom designs, with eight
PEs on a chip. Each chip has access to fast local Iì.AM, although the mernory interface still
represents the perforrnauce bottlenec.k due to the speed of the custom r:il'cuits. A bloc.k diagra,m
of the NIPP is shown in Figure 1.5. frorn [80].
Figure 1.5: MPP Architecture (from [80])
The processing elernents have no built in fixecl or floating point alithmetic, only a full adcler.














require complex arithrnetic to run efücientl.y. The MPP provides working performance of up to
seven billion arlditions per second [80].
Distributed Array Processor
The DAP is constructed of blocks of 16 PEs, in various two dimensional sizes, such as 32 x
:12, 64 x 64, etc. A DAP PE combines 4096 bits of RAM with a bit-serial adder, and local
inter.connections to the nearest four neighbours. The loc.a,l intelconnections combine to form a
grid, along which arbitrary data movernent is possible (within the confines of the SIMD array
architecture). Therefore, data shifting becomes very easy. A diagram adapted from Hockney
and Jesshope 137] is shown in Figure 1.6.
Column Hþhwey
Figure 1.6: DAP Architecture (from [37])
Geometric Arithmetic Parallel Processor
The Geornetric Arithmetic Parallel Processor (GAPP) is aimed at low precision image pro-
cessing, and as such uses very simple processing primitives. The GAPP pl'ocessor cascades an
arbitrary number of chips, each containing an arrayof 6x12 processing elements (Ptrs). A com-
mercial array typically will contain approximately 2304 PEs, conflgured as an array of 48 x 48
PEs. Flach PE couta,ins an ALfl, 128 bits of IIAM, aucllocal interconnection toeach of the PE's
four nearest neighbours. Each instmction is broadcast to the complete array' in the standa'rcl
SIMD forrn. The ALUs are single-bit processing cells.
The GAPP structure is very similar to that of the DAP, although there is much less memory
made available to each PE (128 bits compared to 4096 bits). This results in a very high chip
de¡sity, ancl so a large number of chips can be packed into a small area. However, as the
GAPP PÐs were clesigned for use in image processing and low precision applications, the PE's
resources would rapidly be exhausted in the event that a numerically intensive algorithm, suclt
as boundary integral evaluation, were atternpted. The ploblern could still be feasibly run on a
















R eg iste rs
8x64 bits
L.3.2 Multi-bit SIMD Processing Arrays
The Cellular Array Processor
The Cellular Array Processor (CAP) [49] was developed by Fujitsu, and includes a multibit ALU
that can perform block floating point operations on the applied data. Each PE is more complex
than those of the DAP and GAPP and only twenty will fit on a chip, even though the CAP
was inrplementecl using a 1.25¡tm process instead of the 3¡rm process used for the GAPP. Of
the twenty PEs, only sixteen are used at any one time. The remaining four are used to provide
fault-tolerance in the svstem, so that tp to25% of the PEs can fail on a chip and the operation
of the system will show no apparent change. trach Ptr incorporates 4000 bits of rnernory' and
mole can be aclded externally if required.
Not only are the arithmetic units of the CAP processor mote complex than those of the
DAP and GAPP systems, the control is more complex as well. The CAP processing elements
can control three independent functions of the Ptr, which are the arithmetic/logic function'
the memory/common bus function and the I/O function. Thus. the CAP is aimed at more
computationally- and I/O- sophisticated problems that exhibit a large degree of parallelism,
such as parallel searches and Kalman filtering.
The Saxpy-l
The Saxpy-1 is a vector architec.ture that uses systolic principles to obtain fast inter-processor
communication. It was developed by Saxpy Computer Corp in the mid to late eighties. Ihe
system is composed of flve major segrnents, shown in Figure 1.7. The peak performance is
estimated at approxirnately 1000 MFLOPs.
Figure 1.7: Saxpy-l Block Structure
The frve segments are:
o The System Controller, This a general pulpose computer, suc.h as a DEC VAX, mnning
VMS. The System Controller is used to compile and link the application plogram, and
coordinate the allocation of resources.
¡ l\{atrix Processor. This is a linear array of up to l}2 pipelined, floating-point proces-
sors that have systolic and global interconnections. Although global interconnections are








Sax lnterco n n ect
System
Controller
implementation.justifv this departure [ZZ]. A diagram of the Matrix Processor segment
and the arrangement of the 32 processing elements is shown in Figure 1.8
TolFrom
System M emory
Figure 1.8: Processing Element Structure
o System Memory, which stores all the data arrays for use by the matrix processor.
¡ Mass Storage System, an I/O interface that provides access to high speed data-storage
peripherals
o Saxpy Interconnect, a bus contaìning both data and control that links the other four units
of the system.
The only part of the Saxpv-1 that uses systolic. techniques is the Matrix Ptocessot, which
operates as a SIMD systolic. ma,chine. None of the processittg elements possess inriependent
prograrn co<le. and program control ultimately lesides with the host computer ot workstation.
1.3.3 Configurable Processing Arrays
With the very large ra,lìge of svstolic a,lgorithms tha,t have beeu developed, att'ernpts have beeu
made to design s¡-stolic systerns that can be rccouflgurecl to optimally suit several algorithms.
Ðxamples of these inclurle the 'Conflgurable Highlv Parallel Computer' (CHiP)[87] and the
'Prograrnrnable Svstolic Chip'(PSC) (from Nlarwood[58], [26]), although probably the best
known architecture of this cla,ss is the (i)Warp machine, which was designed and rleveloped by
Carnegie-Mellol university and various industry partners [49].
The Warp systeln is conposed of an array of ten or more programmable processing cells.
each capable of 10 MFLOPS perforrnance, together with a host MC68020-based machine and
a host/array interface unit. A block diagram of the Warp is shown in Figure 1'9. Data flows
through the array onpa,ths labelled X þ,Y, with a third path allocated to address and c.ontrol
signals.
Each processilg cell conta,ins its own progranì store and sequencer' a SMFLOP multiplier,
a SMFLOP ALtl, a 64 Kword memory, communication interface and a register fiie' The cells
can oper¿te on the data with IF-THEN-BLSB ancl DO-WHILE sttuctures, and must each be
programmed separa,tely. More complex operations such as clivide and sclua,re root are left to
specialisecl 'Boundary Processors' (BP), which can be attached to one end of the array. A 10-
cell Warp machine is capable of computing a 1024-point complex FFT every 0.6 nts' and a2-D
cosine transform on a 256 x 256 irnage in 13mS.
The Warp syster¡ was followecl by the iWalp arc.hit,ecture, a partnership between Carnegie-























Addr ¡ Co nlrol Y dsto
Figure 1.9: Warp Architecture (from [49])
consuming some 600,000 transistors for a computational output of 20 MFLOPS per processor,
together with up to 64 Mbytes of memory. [t can be seen, therefore, that the transistor count
becomes very high for systems of any significant number of cells if the cells themselves become
complex. Indeed, the increase in complexity is contrary to one of the initial aims of systolic
processing as deflned by H.Kung, namely the construction of arrays of small, simple cells that
can be repeated across a VLSI structure.
L.4 Conclusion
The one point that becomes apparent frorn a study of the existing ¿rchitectures is a sinrple
conundmrn involving the complexity of the cells (processing elements):
1. If the processing elements are very sirnple, the versatility of the processing arrayis greatly
restricted, and so the potential market and range of applications is limited.
2. If the processing elements are too cornplex, then systems involving large numbers of pro-
cessing elements become unwieldy, and the advantage of systolic processing is lost.
This led the designers of the processing element structure to the conclusion of keeping the
processing element structure simple, while increasing the versatility of the array by using an
inteliigent interface between the array and the data storage to perform the Írore complex data




2.L The lJniversity of Adelaide Systolic Processing Array
Over the past few years, the Department of Blectrical & Electronic Engineering and the Centre
for Gallium Arsenide VLSI Technology, at the University of Adelaide, have been engaged in a
project to implement a systolic processor for matrix computation. The array can perform the
basic matrix operations of addition, .subtraction, multiplication and Hadamard' multiplication,
with more complex operations provided in firmware in the array controllers [54, 7,62,59,63, 55,
56, 65, 57, 5, 1B]. In this way, Marwood eú al. attempt to produc.e the generality rlesired to make
a systolic architecture useful for a large range of applications, while maintaining the complexity
at a level simple enough to minimise overheads and contain the cost of the large number of
computational units necessarv. By using novel address generation hardware, Matwood has
created the generality in address mappings that are reclttired for various systolic arrays, while
the integration of computational units with rnore complex address generation/boundary interface
data controller units ensures that the sirnplicitv of the repeated cells is retained-
2.1.L The Basic Architecture
The systolic processor developed by the Llniversity of Adelaide is cotnposed of a two climen-
sional squa,re array of simple processing elements. Ea,ch processitrg element (P'8.) is c'apable
of perfornring the operations ntultiply, atklitiorz €i n'tuttiply-accumttlate, and contains a sirnple
multiplier, accumulator, output register and control structure. The structure of a P.E. is as
shown in Figure 2.1.
Before the rlata is supplied to the multipliel cell, a control word of four bits is applied to the
input of the processing element. This control worcl defines the cell operation for the duration of
the next operation, and includes information including:
o which operation to apply to the operands
o whether to reset the accumulators (perform rnultiply), or to leave the data in the accumu-
lators (multiply-accu mulate)
o whether to unload the accurnulators, and, if so, in which direction to tnove the output
data.
The operands supplierl to the multipJier ccll arc broken into one nibble (four bits) digits,
antl supplied one a,t a time from the 'A i,n,' k'B in,' inputs, whic.h progressivelv builds the full
product from the successive partial produc.ts. The input da,ta is also pa'ssed on to the next cell






Figure 2.1: Processing Element Structure
data is latched before being retrarìsnitted, thus ensuring a fixed propagation delay through a
cell. As the data delayed by this amount is one nibble in size, the delay is referred to a'nibble
propagation delag', or nibble delay.
As data is coming from two directions, the propagation delay in one dimension must be
compensated for when data is applied in the other dimension. For example, if all cells on an
array boundary were to receive data at the same time, then a processing element located at
co-ordinate (1,4) (flrst row, fourth column) will receive its row data irnmediately a,nd its column
data after three nibble delaysl. Therefore, the row clata must be clelayed by four nibble clelavs
for proper coordination tooccur. Of course, this applies equally for the processing element (4,1),
in which the column data must be delayed by three nibble delays, and sinlilarly the row data
must delay applying data to the second column by one nibble delay anrl to the third column bv
two nibble delays, etc. These delays result in a skewing of the input waveftonts, by one nibble
delay period, and so the inputs are said tobe.'n,ibble skeuted,'. The nioble skewing is shown in
Figure 2.2. Note that the loacling of a,dja,cent words in a c.olumn for the vertical dimension or
in a row for the horizontal dimension is initiated before the conpletion of the loading of the
previous word.
By nibble skewing, the operation of the array c.an be approximated to be that of the cal-
culation of an outer product. A matrix product is then a series of outer products that are
accumulated within the array, and thcn unloaded as a batch at the end of computations. This
approach requires no long bus drivers (and the ir-Lherent delay in such devices) that are needed
for broadcast strategies, and hence can be scaled to an arbitrary size lirnited only by the speed
data can be fetched from memory.
In practice, the 'arbitrary'size of the processing array is bounded on each side by practìcal
colsiderations. If the processing array is too small, then the start-up overhead will be significant'
and no speed improvement over a scalar processor is obtained (consider the lirniting case of n --', I
for an n x n array). The maximum size of the' processing array is limited rnainly by algorithm
considerations. While a large array is very efflc.ient if it is operating on data that 'fills' the
a,rrayT the efficiency drops considerably if the problem dimension is signif,cantlv smaller than
the processing array dimension ("g. u 10 x 10 matrix product on a 20 x 20 processing array only
uses one quarter of the available processing elernents). Sirnulations of algorithms have shown







u"., on, bn. fna fns lno
::l::::Ï::i::::;:
a0n ----- ao2 aol aoo
atn ----- 412 ail 410
a2n ----- 422 412 a2O
a3n ----- a32 413 âgO "+
"4n ----- ^42 ^14 a40
a5n -----"52 "15 
uso
a6n ----- 462 al6 a6O
Figure 2.2: Nibble (Digit) Skewing Input Data
that a processi¡g array of between 20 x 20 and 100 x 100 processing elements is practically
feasible [58].
2.L.2 The Outer Product
Contrary to most matrix systolic aïrays which perfolm matrix multiplication in the inner-
product, our matrix processol calculates the answer' \n outer-product fonn. There are two reasons
for this:
o Memory access patterns are sirnplified.This is a simple observation of the access patterns






































































!,' ;r. ,: i
Figure 2.3: a) Inner Product b) Outer Product
Consider the inner product configuration in Figure 2.3a). If it is assumed that the mat,rix
is store{ in standard form in memory (ie a cornplete ruw or column is stored lìnearly in
memory, followed by the next row or column) and the systolic. alray is of size P x Q' then
the access pattern to fetch øs0, thel ¿or & a1e, followecl by øs2, all k a2s, as shown in the
figure, will be
72
0, (P, P+t), (2P,2P+t,2P+2), ...
ie. an ou,t of ord,er access.
Fo¡ the outer product array in Figure 2.3b), the sarne storage arrangement will result in
the a,ccess pattent
0, rr2, ..., P-1, P, P+1, ...
which is a1 ir¿ order access. This results in fewer rnemory bank conflicts and a higher 'hit'
rate for the caches.
o Operations on smaller rnatrices al'e nlore evenly balanced. This relates to the previously
mentioned advantage, but is conccrned more with the delay slots which must be inserted
into the inner-product confi.guration.
One of the motivations behind using systolic arrays is that the computational cells, or
processing elements (P.E.'s), can be made simple and be replicated many times on a
single chip. Therefore, it is not desirable to attempt to make the cells run at varying
speeds.
However, if the cells all run at the same speed all the time, only I of the available
banrlwidth is utiliserl at start-up, then þ, "t, until the full bandwidth is utilized 
after
f u,.."rr", after the initial access. Although this rnay not be significant if the matrix
being operated on is rnuch larger than the array dimension P, the effect on matrices that
are a similar orcler to the array is to double the time a computation takes' In Gaussian
elimina,tion (discussed later) operations on rnatrices of similar order to the processing array
are common place. As the outer product conflguration uses all the processing elements all
the time, the full computational rate can be achieved.
2.1.3 Interfacing to the Processing Array
A complete system based a,round an array of processing elements will include a memory system
for storing all the required opelands. 'Iherefore, a rnemory interfaee is required, that will
provide the link between the array and the rnernoÌy subsystem. The hardwa,re that is usecl
for the genelation of the correct addresses, the receiviug of the desired data, and other sundry
functions can be incorporated into one unit, termed fhe'Do,ta Control,lers'. The functions these
provide include:
o Acldress Generation
The addresses of the next nemor-y a,ccess must be calculated according to some formula.
In this architecture, a difference engîne is used, which calculates addresses according to
the forrnula
neut ad,dr. : (preu add,r. I offset) trrod m,ar. add'r.
The acldress generator is dealt with in more detail in Chapter 5.
o Memory Interface
The f)ata Controllers must supply the ad<lress calculated by the Address Generatols to
the correct memory unit, and then read the lesult back. In the event tliat a multi-level
memory is implernented, the f)ata Clontrollers must also check cache memory tags before
reacling data to ensuÌe that the correct data is read into l,lte corLtlollels. A proposed









Figrrle 2.4: Block Diagram of Data Controller
o Convert and Supply Data to Array
The clata that is read from meÍìory is supplied in 'worcl'form from the memory, and must
be converted to nibble forrn anrl the rate of supply slowed down before it is passed to
the array. This is done with a series of shift registers with parallel load and serial unload
facilities.
o Support Arithmetic
As the processing elements only perform the most basic of arithmetic operations (multiply
and addition), any extra operations that are rec¡rired in a,n algorithm must be irnplemented
within the Data Coutrollers, either before the operands are applied to the array (such as
row normalisation) or after they are extracted (such as trace deterrnination).
A block diagram of a Data Clontroller is shorvn in Figure 2.4, while a diagram of the cornplete
system is shown in Figure 2.5.
2.L.4 Processing Element Redundancy
As the typical size of a processing array will vary approxirnately between 20 x 20 to 100 X 100
processing elements, the total number of P.E.'s in a system will vary between approxirrrately
400 and 10,000. With numbers this high, consideration needs to be given to the reliability oI
the processing elements, and hence of the complete system, Marwood and others [62, 75] have
noted the importance of including redundancy into such large systems, and techniques have
been devised that effectively reduce the defect rate to close to zero. These include concepts
such as configuring the processing array witli an extta row or column of processing elements and
switching the data flow 'down' or' 'across' a row or column in the event of processing element
failure. Figure 2.6a) shows the a 5 X 4 processing array operating correctly, while Figure 2.6b)
shows the same processing array recorLfìguretl aftel the failure of processing element number 11.
Using this approach, two extra multiplexers are required for each processing element for each
dimension of redundancy, for a total of 4f multiplexers if the array is of dimension p x p antl


















Figure 2.5: Block Diagram of SYstem








t7 t8 I9 20 t7 18 20












Similarly, failure amonÉl the processing elements can be dealt with by the data' controllers, by
switching in a complete new row or column fed by the controllers. This is illustrated in Figure 2.7.
Although the versatility of this approach is recluce<l compared to placing the switching with the












Figure 2.7: Redundancy Using Data Controller
2.2 Evolution of the System
The matrix engine designed at the University of Adelaide is very similar to one designed and
ìmplemented by Marwood and others at the Defence, Science & Technology Organisation at
Salisbury, South Australia, called the Systolic Configurable Array Processor (SCAP) [18, 54, 55,
56,58,59,61,62,63. 75]. Although the projects were initia,ted by Marwood at approximately the
same time, several SC-'AP s1'stens have already be constructed. The SCAP processing elements
were constructed using CMOS technologv, whereas the University of Adelaide design was in
Gallium Arsenide, partly to aid development of that technology. The data sheets describing the
SCAP architecture are included in Appendix A. The important comparison points from the
data, sheet of the SCAP A17502 Data Controller are reproduced below.
o Convenient interface between conventional memory architectures and a Processor Array
¡ Cascadable to service arbitrary sized processor arrays.
¡ Bus interfac.e allows comrnon or independent memory subsystems.
¡ Flexible matrix addressing modes are built in [to] provide zero cost matrix operations
such as Transposition, Nega,tion, Conjugation and Mapping for prime factor and other
algorithrns.
¡ Direct support for both real and complex matrices, submatrices and non-square matrices.









. PÌogrammaltle modes of operation. The same device can be used to fetch operands or to
store results.
¡ Exec.utes its own instruction stream frorn host rìenìorJ-, allowing complicated algorithrns
to be perforrned without host interveution'
¡ Eactr Data Cjontroller can fetch operands <x store resrtlts at a rate of 6.25M floating point
words per second.
These a,re the iterns tha,t will be used for direct comparison with the system described in this
thesis.
A range of systolic processors that are not as closely related to otrr system as SCAP was
presented by Johnson, Hurson & Shirazi[41]. Their tables are reproduced in part in Tables 2.1
k 2.2. Some of these are briefly described in Chapter 1.
Table 2.1: Contemporary Systolic Arrays in SIMD Configuration
For comparison purposes, the rnatrix engine can also be compared to high end'add-on'artay
processors or', alternatively. with vector processols. In 1987, Dongarraproduced acomparison of
the c.haracteristics of vector superconìputers in the 1980's, which was repeatect in 1992 by Kahan
in his exa,mination of superc.omputing in Japan in 1992 [43]. These were used by Marwood [58]
as benchmarks against which he could compare the principles of systolic processiug. The tables
detailing the compa,risons are reproduced below, as Table 2.3 and Table 2.4 respectively.
2.3 Conclusion
Bv performing an outer producl, on the a,rray instead of an inuer product, rnernory pa,tterns a're
sìrrrplilieti altl start-up clelays are reduced. The simple processing elements can bc rcplicatecl a
large numbcr of timcs on a single chip and many chips included on a single MultiChip-Morlule
(NICM) to produce a large fac.tor of parallelism while nraintaining the sirnplicity of the simple
architectlre. More complex algorithms will be implemetrted by perforrning more complex tasks
on an interface 'I)aLa Controller' chip.
Instmction clecoding occurs once per chip;






University of Texas at Austin




32-bit f.p. capability; broadcast and Saxpy








48 x 48 array
2304 cells
ComrnercialGeometric Arith. l)ar. Proc.
NCR
16-bit fìxed point ma




IRISA, Campus de Beaulier
Very small VLSI footprint; 100s of cells

















Associative concepts; MIMD at higlt
level; SIMD at low level
ResearchAssociative String Processor
Brunel University
Programmable cells embedded in switch
Iattice for programmable topology
ResearchConfigurable Highly Parallel Comp
Purdue University
Block processing; f.p. maths;
image oriented




Bit-serial cellular I/O; 32-bit f.p
multipliers in each cell
320 MFLOPs





Norwegian Defense Res. Estab
Warp cell minus on-chip memory
expandable to 1024 cells
8 x 8 arrayPrototypeiWarp
CMU
32-bit f.p. rnultiplication; block

































































































































Table 2.4: Comparison of Vector Cornputers in the 1990ts
owith four vector units
Table 2.5: Comparison of Massively Parallel Computers
The matrix engine is aimed at a middle grouncl of competing technologies. As such, it should
be faster than sirnilar systerns, such as those in Table 2.1 k 2.2. It should also compete on speed
with the previous generation of vector machines (in Table 2.3), although the physical size and
power consumption will be lower. It is these rnachines that many users are currently using, and
rnaking the matrix engine have similar perfolmance will enable these users to get the same level
of performallce on the desktop. with lower initial ancl runuing costs. The latest vector Tnachines'
such as those in Table 2.,1, can be faster than the rnatrix engineì but will be larger, more
expensive, and consurne rnore power. It is felt that a speed in excess of one GigaFLOP (1 x 10e
floating point operations per second) is required on a matrix engine with array size of between
800 ancl 1600 elements to firlflll these conditions. This frgure is the sttstained or equivalent
'required' floating point perfornì.a,nce, ie the total nurnber of recluired floating point opelations
using avery efflcient (the most efficient) algoritlim divided by the total time, including start-up.
Note that the performance flgures quoted in Table 2.4 are the peak pelformances of the listed
supercomputers, and that peak performance on these types of machine are typically three to
ten times higher than the susta,ined or equiralent recluired performance [93]'
The work unclertaken for this thesis was to investigate the ability of algorithms to be modiflerl
to this simplified array, and to design the interface and memory speciflcations. This included
an analysis of the algorithms to include any 'special' ot 'sundry' functions that must be incor-
porated into the interface 'data controller' chips, and also the design speciflcation of a memory
subsystem that would efficiently support the algorithm. Arlditionally, the feasibiiity of incorpo-
rating several complete processing arrays? or omatrix ertgirlerj', into a multiprocessor architecture
was investigated.
To this end, the remaining chapters have been divided largely according to these ta'sk re-







































In this chapter, four algorithms and their implementation on the matrix processing array will
be presented. The algorithms are matrix rnultiplicatiol, Gauss-Jordan elimination, the discrete
Fourier transform and Kalman filtering, covering flelds including numerical cornputation, signal
processing and control systerns. The expectecl performance of the algorithms when implemented
on the matrix array with memoly interfac.e is plesented in Chapter 7. A overview of the memory
system, enough to explain the algorithm analysis, is presented here, while the memory subsystem
is described in more detaìl in Chapter 6.
3.1 Memory Subsystem: An Overview
Most computations are of a forrn in whic.h two operands are fetched frorn memory and a solution
is storecl to mernory for each operation. Therefore, three ports to nemory are ideally available
if the matrix processing array (the cornputational engine) is to maintain its high computation
rate. Unfortunately, several algorithms that will be implemented on the matrix engiue have
addressing patterns that ca,use conflicts when implemented on standard high performance three-
port rnenrories. TherefoLe, a ba,nk-switched four-port rnernory was devisecl that a'llows two loads
ancl one store to nlerìory simultaneously. The bank-switching is achieved using a custom 4-1
bidirectional multiplexer, with caches attached to the rnemory banks that switch betu'een tl-Le
memory ports of the matrix engine. Additionally, to cope with data that must be recirculated
from the output bac.k to the inputs very cluicklv, two clual-ported static RAM chips are inc.luded,
one for eac.h of the output to input paths. A block diagram of the menlory systenl is shown iu
Figure 3.1.
As shown in the frgure, each lnemory bank has a cache system attached to it, on the memory
side of the bank s.ivitches, unlike nìany othel systems which attach the cache to the data port.
This means that cache data will move with the main memoïy bank, easing the constraints caused
by cache flushing a,ttd coherency consideratiotrs'
3.2 Matrix Multiplication
The most basic of operations to be run on the matrix engine is matrix multiplication. As
mentioned in Chapter 2, l,he lrat,rix arlay is confrgured to perform an outer product. Thus each
of the data controllers applics a vcctor to thc eclge of the array, and the array pt'oduces the ortter
product of the two vectors. The outer product can thus be clefined as























Figure 3.1: Block Diagram of System
where r is the dimension of the vectors v.
The matrix product in terms of outer products is the in-place sum of the vector outer








f taor.o, biza.i,. . .,b¡,a¿) (3.2)
j=1
3.2.L Partitioning
While Equation 3.2 is appropriate to describe the outer product of two matrices that are of a
smaller size than the processing a,rlay of the matrix engine, in practice the matrix will generally
be much larger than the processing alray. Indeed, if the rnatrix were smaller, it may be faster
to process such a small problem on a general processor, rather than going to the expense of
configuring the matrix a,rray. What is needed is a way of partitioning the matrix multiplication
into segments that can be irnplemented on a, fixecl-size processing array that is smaller than the
matrix multiplication problem.
The partitioning is relatively simple, and can be derived by dividing the vectors a¿ in Ðqua-
tion 3.2 into several vectors with at most p non-zero elements, for a, processing array size of
2t













and each is processed in turn. Of course, the zeros are only included for theoretical explanation,
and in practice, only the non-zeïo elements are used. One way to consider the partitioning is to












where the colon (:) is used to indicate a complete matrix dimension, in this case of a submatrix.
Tlrerefore, the nota,tion 41. indica,tes rows 1,...,p and columns Ir...rn of an n¿X n matrix A,
A2.indicatesrows p;-l,...,2pandc.olurnns 1,...,?z,etc.,andisreferredtoasthe rowmatrir
A¿. Similarly.thenotationB,lindicateslowsl,...,manclcolumnsl'...,pof atmxn matrix
B, Bz, indicates rows 1, ...,ntr and columns pf 1, ....2qt,etc., and is referred to a,s the cc¡lumn
m,atrir B¡
Then ea,ch block of tlte matt'ix C can be calculated as:
c¡j = I"j¡i (3.5)
t
where aj is the lrå colurnn of the row rnatrix A;, bl is the ú¿â row of the column matrix Br, aud
C;¡ is the (i, j)r/'partitiou of the rnatrix C.
I1 effect,, the partitioning ca,n be viewed a,s a means of producing p2 inner-prodrrc.ts simulta-
neously.
Divi{ing the matrix in the general case without padding produces four regions' as def.ned in
Figure 3.2.
The subscripts denote the coordina,tes of the block being calculated, while the superscript
rlefines the region in which the block falls. This notation follows that def,ned in [65,58]'
A two dimensional address generator can extract each of the blocks one a,t a time. The
overhead involverl i1 acc.essing each paltition independently is large, and a,n a'utomatecl man-
1er of determining each block in turn is necessary. A four-dimensioual address generator will
automatically calculate the partitions in region 1, a three- dimensional generator is needed for
regions 2 anr|.3, while a two- dimensional generator is recluired for region 4. Together with
the choice qf four 'base-addresses', the partitions shown irr Figurc 3.2 can be automatically
generate{ witholt any exteLnal intervention. A C++ class to irnplernent this is found in [58].
A consi{eration to be made is one concerning the fetching of clata from rnain memory into





























:1 )(;C ¡¡r C r,rz C¡v¡v A¡¿r A¡¿z
Figure 3.3: Memory Access for Matrix Product
simultaneously to memory. The caches improve the average memory access speed significantly
due to the large arnount of cache reusel . I{owever, if one input controller is fetching data fi'om
main memory after a cache 'miss', the other rnust stall and wait if it has a cache 'hit'. Therefore,
assuming that the complete rnatrices frt into the cache (up to 1000 x 1000 for a 1 MWord cache),
there is a factor of approxirnately two introclucerl into the miss ratio and hence an increased
total rniss penalty.
Using the matrix product in Figure 3.3, in which two n x n matrices are rnultiplied together
otì a p X p arra.y and 1v- : tîl , the origin of the factol of two can be seen as follows2.
If the firsl, block of the soiution to be detelrninecl is block Ct1, then blocks (Arr, Arz, . ' ., Ar;v)
arrcl (Br t,Bzt,...,BNr) neccl to be fetched frorn memorv ilto the cache. 'Ihe next logical block
to calcrrlate woulcl be either block C12 or C21. However, whichever of these is chosen, either
the contlollel accessing matrix A or the controller accessing rnatrix B respectively will have a
cache hit while the other has a cache miss. This results in a memory access time rnismatch
anrl the rlata controller that had the hit would have to stall. Therefore, the stanrlard (logi-
cal) access pattern of calculating the solution blocks in rows or columns will result in cache
misses along two of the four edges, ie trisses when calculating blocks Cn, Cn, ' . ., C1¡¿, and
Czt,. .., C¡¡r. The rernaining blocks will all achieve cache hits. This results in 2trl - 1 ntisses
uJ (¡r _ t)': N2 _ 2N + t hits for ahit/miss ratio"r H : + - å+ #-.
If the pattern of calculating blocks is changed to calculating the diagonal blocks flrst, both
data controllers will ha,ve ca,che hits and rnisses at the same times. As the misses all occur when
the diagonal blocks are being fetched, there will be try' misses and 1i(¡/ - 1) : N2 - 1ú hits,
for a hit/miss ratio of ry$Ð - 
^/ 
- 1. As #- -. 0 as N gets large, it is obvious that the
hit/miss ratio for calculating the diagonals flrst is approximately twice tliat for calculating by
rows or columns.
tFo. .r, r¿ x n matrix on a p x p array, the data is used lfrì titt "t2This of course extends to a rnore general case of trvo arbitrary rnatrices beiug multiplied together. l-or
convenience, the cache access pattern is considered, without loss of generality, for the given size
23
3.3 Solution of Sets of Linear Equations
With a speed in excess of one Gigaflop for matrix operations, where a FLOP is a Floating point
Operation Per Second, a methocl of solving large systems of equations on the matrix engine would
be of considerable beneflt. Of the several possible algorithms available, it is Gauss-Jorrlan which
seelrìs to hold the most prornise, although others (eg. Ga,rrss elimination, LU decomposition)
can easily be implemented with only relatively rninor modifications.
The set of equations is denoted as
A.x: b (3.6)
where A is an rnx n rnatrix, x is an n X l vector of unknowns and b is an r¿ X l coefficient
vector. There can, of course, by several 'right-hand-side vectors'b6, br, ..., b", in which case
the vectors x and b become r¿ X r matrices, and can be solved simultaneously.
3.4 Gauss-Jordan Elimination
In engineering and mathematics, the solution to a very large number of equations is often sought.
Common methods of solving such a system of equations include Gauss-Jordan elimination, Gauss
elimination and back substitution and LU-clecornposition and backsubstitution3
Of the three Gaussian approaches to solution of equations (Gauss elimination and back
substitution, Gauss-Jordan eLirnination and LU-decomposition and backsubstitution)' Gauss-
Jordan elirnination is generally not the recommended nrethod due to its increased operation
count [48, 73] and the fact that all the 'right-hand-sides' of a set of equations4 need to be stored
and manipulated at the same 1;ime as the elimination. However, Gauss-Jordan elimination has
aclvantages which are specific to implernentation on a rnatrix array. These include:
o I'ixed-colurnn Size
Gaussian and LU decomposition both 'triangularise' the matrix of linear equations, and
then backsubstitute. One consequenc.e of this is that the nurnber of elentents operated on
in sequence is continually diminishing until only a single element exists in a row.
o f)etermina,tion of the Iuverse
3.4.L Block Version
It is now well clocumented how to obtain a block version of Ga,ussian basecl elininatjon [1. 58]'
although a brief ovelview follovl's.
Gaussian elimination works due to three prenises of elementary row operations that leave
the matrix system una,ffected a,s a whole. These operatìons are[48]:
o Interchange of two rows
o Multiplication of a row b.y a non-zero constant
¡ Addition of a constant rnultiple of one row to auother
Thus, a system of linear equations ,Ç1 is rou equiualent to another sYstern 52 if Sr can be
transformed into 52 using the elemeutary row operations listed above.
3Others exist, such as QR decornposition, which use more complica.ted opcrations than the linear addition and
scaling operations in Gaussian elimination







The low-orientecl Gauss-Jordan elimitratton ls
Figurc 3.4
folk:1to¡iz






Row Oriented Gauss-Jordan Elimination
To transforrn this into a, bloc.k form, sirnply let the sc.alars ø¿¡ become block matrices A¿r,
where each A¿; is of dimensions p x p and p is the size of the matrix a,rray. A premultiplication
of p rows by a p x p block can be shown to be simply p elementary row operations, as shown
below for the first row of just such a,n operation.
Ar, = rowl{BC}
: (ôrr"rr I bnczt+ ... + brycpt,bt(y¿l l¡nczz+ ...+ blpcp2t.'.t
bnhplbnczp+...+ brr"rr) (3.7)
ie row one of A is a linear combination of the rows of C, with the scalar coefficients being
elements from the matrix B.
3.4.2 Determining the Inverse of the Pivot Block
Although we now ha,ve an algorithm to calc:ulate the solution of a set of linear eqrrations in block
forn using only matrix outer protlucts, there is still the requirement of calculating the inverse of
the pivot blocks. Logically. Ga,uss-Jolda,n elimination coukl a,gain be used for this calculation.
However, as the systolic array woulcl need to be rtnloa,ded after eaclt uector of the matrix is
applied. this would be very inefÍicient, and an alternative was sought.
Such an alternative is to use an it,erative algorithrn. One that is well known goes under the
varions nanì.es of Hoklling's Llethod ancl .9cl¿zlf z's illethocl and others, a,lthough the algorithm
basically recluces to a ma,trix version of Newton's mcthod of root flnding [73' 70]. Newton's
rnethod appLied to finding the inverse of a scalar is considered in rnore depth in Section 4.2'7.
Suffice to say that ìt produces the result
Yzn+7 - 2Y,, - Y,.A.Y" (3.8)
which is an iterative matrix equation for forming the rna,trix inverse Y of the square tnatrix
A. Note that the precision of convergence ?¿ rnore than doubles at each iteration. In fact, the
conver.gerìceis quadl'aúzc if Yo is sufliciently close to A-1. Oue choice of the initial estimate Y6
that satisfies the convergence condition is
Yo: rlios(#,#.. . .,#) (3.e)
Appeldix B contains an r:xplanation as to why this choice converges quadratically to the corlect
solution of the inverse of A.
I
25
3.4.3 Implementation on the Proposed Memory Architecture
The rnemory architecture described in Chapter 6 is a multiport, banked-switched cached system.
The structure is repeated in Figure 3.1 for convenience.
Clare rnust be taken rvhen implementing an a,lgorithm on the matrix engine to ensure that a
minimum of stalls oc.cul while waiting for rlata to cycle florn output to input via, the rnemory
(either cache/main mernory or the dual ported loop mernories). The load/unload structure of
the processing elements in the arrav) as defined by Marwood [5S], is such that a matrix-rnatrix
product can be initiated sirnultaneously with unloading the arra,y. 'I'he square array is matched
suclr that the tirne to loacl ancl apply a, p x p matrix to ea,ch of the inputs and calculate the outer
products is the same a.s the tine to unloacl the previous product, assuming memory loacl/store
times are equal.
The Gauss-Jordan elimination can be run in three parts, namely
¡ Invert pivot block
r Normalise pivot row
o Apply elementary rowoperations to zeroall blocks on the same column as the pivot block
Implementing Block Inversion
Inverting a pivot block is a cyclical operation, and takes advantage of the dual-port RAMs that
feed data directly from the output back to either of the inputs. Additionally, use is made of
the fact that the transpose of a matrix can be extracted dilectly from the artay, as described in










Figure 3.5: Extracting a) Standard b) Tïanspose
Initially, the matrix A¿; (the pivot block) is applied to the left-hand side (c.oefûcients ø¡t
in Figure 3.5) and the matrix Y¡ (the initial estimate of the inverse) applied fi-om the top
(coefficients gj* in Figure 3.5). Iìecall that tlie initial estimate Y6 is generated 'ou-the-fly'


























the calc¿latiol of Y¡ and the transpose extracted frorn the array, the resulting product that is
extlacted is
OrLt7lut(: Xo) : -(A.Bo)r (3.10)
Noting the r:elationship
PTe7 : (eP)r (3.11)
and as Ys is a diagonal matrix, then Yfi : Y0, so that cycling the initial product X6 back to
the left hand input aud applying Yo again from the top input plo<luc.es the result
(-AYo)TYs = (-YsAY¡)r (3'12)
If the ma,trices 2I ancl Y¡ are appliecl to the array while the previous ploduct was being
extracted, and the array is not cleare<l, the fìnal product that results is the desirecl
Yr : 2Yo - YsAYs (3.13)
Subsecluent iterations proceed in a similar manner, except that the curretlt estimate Y' for
the next estimate Y pn¡r¡ may no longer be diagonal, so the relationship Yl : Y6 maf not
hold. I{owever, as the estirnate Y,, is now stored in the dual ported RAMS, and not calculated
o1 the fly as it was for the flrst estirna,te, then the transposed matrix catl be extracted simply
from the RAMS and the same iteration procedure applies. The complete ecluations for later
iterations becorne:
Y(zr,+t) (rt|+ ({-av,f tT))' (3.14)
( 3.15)2Y,, - Y'AY,,
'Ihe sequeuce of 'r¿'s will be
r¿:0.1.3,7,15,31,63,127..'. (3.16)
Thereforc¡. for ll2-bit precision ancl setting r¿:0 to sta,rt, flve iterations are required. Ouce
the lìnal inverse has been detenninercl. it is kept in the clual polted RAI\{ for the nortnalisatiou
phase.
Normalising the Pivot Row
This is a very sirnple part. Once the inverse of the pivot block is determined, each of the blocks
in the pivot row a,re rnultiplied bl the inverse of the pivot element. Tìrus, the pivot element
becornes the irlentity ma,trix and does not lteed to be stored.
The one point that should be noted is the writing of data, from the output. In traditìonal
Gaussian elirnination. storage is conserved by overwriting the original matrix with the rrpdated
version. However, as the rnetnory alchitecture proposed does not allow simultaneous reads and
writes to the same bank of memory, such a scherne would cause intolerable delays due to stalls.
However, as memoly is relatively chea,p compared to improvements in processor performance,
the updated matrix can be written to a uew bank, which can thett be swapped with the previous
input bank for successive iteratiotts.
27
Zero AII Blocks in Pivot Column
This stage is the update phase, characterized by
A¡o : Aj¡-ArrAr;A¡-/
= Ã.¡¿ - A'¡*Axt (3.17)
Care needs to be ta,ken to ensnre tha,t lninirnal memory c.onflicts occ.ur during the update
phase. Conflicts are likely in this algorithn a,nd at this particular^phase, as a block from the
output must be fed into the input (narnely the normalised block A¡¿). Therefore, the output
bank must be available to the input. Bearing in rnind that the previously updated block will
be in the process of being written to the output bank when the next update is started, then the
lormalised pivot row is not available. Therefore, to fill time, load the available block A;¡ into
the accumulators using the operation
Matri,r Arra'Y: Ai¿-I (3'18)
By the time this operation is concluded, the previous write to the output bank should have
concluded. Therefore, block Â¿¿ will be available, an<l can be applied to the top of the arra,y. If
this block is negated before it is applied, and tlre array is uot cleared before applying the matrix
product, the resulting matrix in the processing array will be
Matrir Array - A¡i - A'¡*Ã*; (3.19)
which is the desired update.
The ordering of the block update does have some signiflcance, although only minimal losses
occur if the wrong (non-optimal) ordering is used. Consider the row oriented form of Gauss
Jordan elimination provided in Figure 3.4. If the pivot row is completely normalized before
any zeroing is perforrned, then one of the inputs to the processing array must come from the
rnemory bank that is being used to store the result, which will recluire a la,rge amount of bank
switching. This is genelall¡' not a ploblem, although it will result in a number of stalls due to
cache tagging. [f, however, the a,ccess \s by coluntns of blocks, as shown in Figure 3.6, then the
normalisecl block frorn the pivot row will be a,vailable in one of the feed-back dual-portecl RANIs,
ancl bank switching will not be requirecl until the sec.ond full iteration of the algorithm. Each








Figure lJ.6: Column Oriented Gauss-Jordan Elimination
9.4.4 Inverting Rather Than Solving a Set of Matrix Equations
The c.ase may exist where it is desirable to invert the given matrix rather thau solve it fol a, given
set of 'right hand sides'. llhis is a sirnple extension of the Gauss-.Iordart elimitration technique,
and can be implemented easily on the matrix engine.
2B
Instead of perforrning elernentary row operations on the matrix A, instead perform the same
row operations on the augmented matrix A = [AI]. As Gauss-Jordan elimina,tion on AX : I
implies that X = A-lI = A-1, then elimination on the new augnìented matlix implies that,
while A reduces to the identity I, the identity irr the augmeuted matrix, Io, reduces to the
desired inverse.
Of course, the entire augmented matrix need not be stored, as it can be created as needed'
In the flrst iteration (on the lìrst row of blocks), the only column of the identity matrix Io that
is affectecl is the first one. Therefcrre, as each colurnu is eliminated from A, a new one is addecl
onto the augrnented matrix A.
The actual opelations that a,re required for the norrnalisation and update phases can be
simplifled when operating on the matrix Io due to the inherent 'ones'and 'zeroes'in this matrix.
When normalising the pivot row 'i', the operation for the 'uew' column (column '2,'in Io) is
I",, : lor,'Ait
A;'
which has already be calculated, so no extra work is required for the normalisation phase. The
updating of all the other blocks in the same column is
l),, : Io,o-A¡¿lo;¡
: 0 - A¡;A;1
: _Ar;A[1
which is a single nrultipLica,tion.
Thus the creation of the inverse can be achieved witlt simplicitv by some intelligent extra
control in the input and output da,ta controllers that maintains rather than reducing the size of
the applied ma,trix.
3.4.6 Solving Sets of Equations That Don't Fit Into the Cache
With 1 Megawold caches, C-lauss-Jordan elimination can be used on matrices with up to a,pprox-
imately 1000 x 1000 elements rvithout the neecl to load data more than once frorn main memory.
However'. systerns of up to 10,000 X 10. 000 or more can be solved efficiently, main memory size
perrnittings. It is important. howerver. to partition the problem into'cache-sized chunks'.
The partitioning of the Gauss-Jordan algorithm to maxirnize cache reuse is similar to that
used above to convert to block forrn (Section 3.4.1). Here, we will set the block size to ac-
comodate a single block in a cache instead of in the processing array. Considel a cache size
of 'Cl' Mworrls. The maximum size (square) matrix that will fit in the cache is one of order
r/C thousanrl elements. Therefore, clivicle the matrix into square blocks, each of stze Je ç-- d)
tlrousand element on a side. For an M x N matrix '.8', the division is as showtt in Figure 3.7.
The pivot element ,Ð11 is initially inverted using Gauss-Jordan elimination as described
previonsly. This will create a d x d matlix which can be multiplied by each of the other blocks
in the pivot row. (ie VE1,).
Once the pivot row has been normalised, the pivot column can be zeroed by subtracting the
matrix product of the normalised pivot row ancl the matrix in the pivot coluntn from each row
sStoring a matrix of size 10,000 x 10,000 in rnain memory requires 100 MWords : 400 Mbtyes of DRAM for
32-bit worcls, As the array requires a nrinirnum of trvice the matrix storage requirements to operate efficiently,
at least 200 MWords of memory would be required. The DRAM densitics of tlte near future make these figures
feasible, especially when considering other sinilar performance systems. However, matrices of size 5000 x 5000








Er+t, Er+p Er#t t+t
Figule iì.7: Dividing Sets of Equations Into Blocks
E¿, in turn. These are all large matrix multiplication routines, and the rnatrix array will run at
close to its peak speed.
3.4.6 Iteratively fmproving a Set of Solutions
If a solution vector i is obtained for a set of linear equations, it is possible that it could be
different from the 'correct'solution x by the small amount óx, due to numerical etror accumula-
tion. What has in fact been solved is the system of linear equations solved not for the vector of
known coefficients b, but rather for a slightly different set of coefficients b = b f ób, as shown
in Ecluation 3.20.
.*: Ê





and also Equation 3.22.
Ê-¡ = A(x-fóx) -Ax
= Aóx G.22\
Substituting leads to Dc¡ration 3.23. for which the entire right hand side is known. as x -F óx is
solution that was calculated but incorrect and needs inprovitrg, and b is given.
Aóx:A(x*óx)-b (3.23)
Now Aóx can be solved for óx, which can then be subtracted from x * óx to returtl the
corrected solution x. Tht¡ nurnerical error in óx will be approxirnately the same as for x in terrns
of the numbel of significant digits that can reliably be considered to be 'correct'. Howevet, as
the most significant bits of óx (the more'correct'ones) are a very good approxirnation to the
amount that x * óx is differelt from x, due to the difference in the orders of magnitudes, the
overall nurneric.al error of the con'ected solution will be reduced'
Two important points need to be considered when implementing iterative improvement on
the matrix processing array. Firstly, if only a single solution needs to be improved, ie x is a
vector rather than a matrix, ma,trix-vector opera,tions will be required, which are inefflcient on
the matrix processing array. Therefore, where possible, the improvement should be used on sefs
of solul,ions, such that * is a matrix, preferably of the same dimension as the processing array.
Secondly, as is pointed out by Press el. ul.[73), if the solution vector is found using LU
clecomposition instead of Gauss-JordarL elirrrirration, tl"Le LU-decomposed matrix can be reusccl
for the solution to the left hancl sicle of Bcluation iÌ.23, saving O(tt) operations. The matrix
engi¡e can be programmed to perform LU decomposition by sirnply saving the intermediate
calculations rather than discarding them.
30
3.5 The Discrete Fourier TYansforrn
The original airn of the natrix plocessol was to use it to compute fast digital signal processing
algorithms. One of these is the Discrete Fouriel Transform (DFT), which is of use when a'nalysing
the frequency spectrum of a given time-dependent signal. A signal deflned in the time dc¡maitt
by the function h,(t) has a,n equivalent function in the frequency dorna,'irt, denoted H(f ). The
lburier 'Iransf'orm is simply the n'rethod of converting from one doma,in to anothel. The Fourier
transfornt ecluations are :
H(r) (3.24)
h(.t) : [* u, ¡)e2"ttd¡ (3.25)
J-æ
Equations 3.24 und 3.25 are often written using angular frequency u., instead of /. However,
using the non-angular frequency /, there will be fewer factorc of 2r to consider.
For'N'consecutively sampled data points sampled at intervals of 'A'time units, the sarnpled
values 'h¡' ate such that:
h¡ : ¡1P5) fr : 0, 1, 2, ".,¡/ - 1 (3.26)




where A is the tirne between sarnples.
Then the Fouriel transform integrals of'Dquations 3.24 and 11.25 can be discretized to the
form:
N-1
-\ (Ä) : I rt n)e-zttknlN
n=0
. .Y-l
.r.(rr) : *f -\(Á,¡e2n'4"/N
,('=0
where H(f,) - 
^X(À')Substituting 
for I,lâ, : ,-ÐrlN, then these discretized ec¡uatious becorne:
N-1
,\ (Å') = | ,1r;tl'S' (3.28)
n=O
, -1
r(n) : *f x(Ðwïf (3.2e),v 
Í.=o
which are the equatious describing the Discrete Fourier Transform (DFT).
Ma,rwoocl derives the following two dirnensional rnapping of a otre dimension DFT (ie Equa-
tions 3.28 & 3.29) [63. 55. 58]. tlsing the mappings
llour"-2r'lÍt¿¡
t, _
(L[1n1 I Mznz) u
(Lrk, + Lzkz) r,t
(3.30)
(3.31)
that tlanslate the one dimensional quantities ¡¿ and /i into (n1 ,n2) and, (A''r,kz) respectively, with
tlrecunstlaints0(n1,ft1 (Àr-1&0<n2,k2{1,[z- l,forsomeconstantsLt,Lz,MlkM2.
31
Then substituting these mappings into Ecluation 3.28. the DÞ"ll can be expressed in terrns of
the (reduced) two dinensional variables.
¡11 -1N2-1




Wfi, : WffzLznzkzWfftLznr*zWffrLtrrtktyyM2Lln2kl (3.33)
If the functio¡ Y(k1,Æ2) is delined to be the mapping of X(L1fu* Lzkz) to two dimensions'
then Equation 3.32 becomes
N1 -1 Nr-1
l'(kr,kr): Ð D, y@r,rz2)Wfi (3.34)
n1=O tt2=Q
Defining the mapping constants A1\, M2, L1 and L2 to be
and for the case where 1{1 and 1y'2 are relatively prime, then the mapping becomes unique and
the second and fourth factols in Ðcluation 3.33 both equal one. Equation 3.34 becomes
¡,¡r_r f¡¿r_t I
I'(Ar, kù= L lt rt ,1.tt)w'f,)N1n2t+21wTu""'u' (3.35)
n¡ =o ln2-0 I


















or by Burrus [12]
a:lJ=6--1=I




where W1 1¡dWr2 are corìventional Ft¡urier matrices of dimension 1{1 X N1 and N2X N2 lespec-
tively.
An important point to realize is that the two dimensional mapping can be extended to a,n
a¡bitrary dimensioned mapping. In fact, Marwood and A.P.Clark show that it is necessary
to use the higher dimensional mappings as the transfolm length increases if anv performance
32
advantage over conventional FFT algorithms is to be prescrved [58,60,61]. From [58], the final
result for a four dimensional case is
¡vr-l f¡vr-r I
Y(kt,k2,A't.kq): t lf I
n ¡=o L¡¡z-0 L
and the coeffic.ieut matrix is again the matrix of Fourier coefficients such that
[þ*,,, "' 
Tù 2' Tt' 3'¡ n a) x w ff i ^ *'f * f , " 
* "f * V' "]
N¡ -lt
-lìn3
(WN,)p,q : e (3.40)
where
4
¡s = fI i/¿
t=I
From Equations 3.39 & 3.40, the extension to 'p' dimensions is obvious. It is shown in Equa-












Marwood [58] shows the proceclure of irnplementing a four dimensional prime factored DFT.
The procedure is desc.ribed in terms of the operations
1. X{,i--WN,X¿,j 0<i, j{IV1,N2
where X¿.¡(.p,q) is delived from r(i,, j,p^q)= ix M1-f j x MzIpx Ms*q x M¿.
2. X{,1 =Wr,¡nX!,¡ 0 ( i, .i 1 Nz,Ns
wlrere X!,¡@,q) is clerived from r:(i,j, p,Ç) = i x M2 I i x Mz t p x lvlqI q x Mt'
3' x{,.':wn,x!.!¡ o<¡,J(1ús,lú¿
wlrere-f,{j(a,17) isderiveclfrorn.r:(r.j.¡t.c1):ixtr'IsIixA,I+lpxfuhlqxÃ[2.
4' x{,!' :14'N,x!,1¡1 o < i,i < lv+,Iy'r
where X!,!¡'Ø,q) is derivecl from ¿(i,J, p,Q) : i x Il,Ia t j x A'h + p x A[z I q x Mt.
A graphical representation of a multidimensional prime factor DFT can easily be shown in
up to three dimensions ,as in Þ-igure 3.8. This figure shows th,tee coefficient matrices, \M1, W2
& \Ms, each multiplied by the'cube'of da,ta'X'in turn. Due to matrix associativity, the order
in which the natrix products are calculated is not important.
From the diagram, it ca,n be seen that the number of products and the sizes of such products
are:
1. ly'1 iterationsof (l[zx ¡{z) x (Nz x lús) & (Nz x 1{3) x (1úrx N:) products
2. Nz iterationsof (¡/sx N3) x (Ns¡ 1úr) & (N.¡ lúr) x(//r x Ni) products
3. rV3 iterations of (l[1 x Nr) x (À-r x 1/z) óL (1{r ¡ 1/z) x (rY2 * ffz) products
This can be extended to one or more dimensions of trvo dimensional matrix products, rvhich
can each be executed in turn. Some perfornlalìce estimates are given in Section 7.3 using this
proceclure. These show that the rnatrix engine performs very creditably on medium and la,rge








Figure 3.8: 3 Dimensional Representation of Prime Factor DFT
3.6 The Kalman Filter
Often found in control systems, signal processing and communications, the Kalman filter is a very
efficient method of providing minimum variance estimates from noisy measurements [45,64,31].
The Kalman filter is, however, of computational order O(n3), which provides quite a bottleneck
if the filter is required for real-time broad-band applications.
Approxirnately ten years ago, the Kalman fllter was used to estimate state vectors of size
ten to twenty elements. As computational power has increased and powerful microprocessors
and embedded systerns have become cheaper, the number of elements in a state vector have
increasedto approximately 50 to 70, which fully utilises the available power. However, high end
control systerns have always demanded computational power one to two orders of magnitude
larger than the lLorltì, capable of calculating the Kalman filter of state vectors three to flve
times larger than is common at the time. For example, the Kalman fllter implemented on the
Voyager I ,k II spa,ce-craft used 67 sta,te valiables with 3500 data points. The computer system
for the fllter was designed in 1977[1a]. It seenrs that, lo matter how much computational power
is available, an application will be found that consumes the full amount, and possibly asks for
more [88]. The airn of the implementation described here is therefore not to provide a competitor
to existing systems, but to opeu the scope of the Kalman filter to include systems that were not
implementable in previous systems.
There are several forms of the Kalman filter. The form presented here is that used by Gaston
& Irwin [31]. They describe the Kalman filter in terms of the equations:
x(k * l) : A(k)x(k)f B(,k)u(k)+ w(k) (3'42)
y(k) = c(Å.)x(lc) + v(k) (3.43)
where x(Å,) is the state vec.tor of the system of dimension (z x 1), V(k) is the (nz x 1) vector of
measured output variables such that m) rt, and u(k) is the (px 1) control vector. The (px 1)
v(k) and (m x 1) w(À) vectors represent white noise with assumed zero meall and covariance.
The matrices A(À,.), B(Â') altl C(4,) are assumed to be known, and are of dimensions (n x n),
(n x nt) k (p x n) respectively.
The aim of the Kalma,n filter is to estirnate the state of a system given a series of measure-




P(È+ 1lk) = n lt*{t+ 1) -i(È+ rl,4)) x (x(,k* 1) -i(k+ llk))"] G.44)
p(Ælk) : n [{"f ri - x(klk)) (*(4,) - x(Å:lt))"] (3.45)
where P is the error covariance rnatrix and the bracket notation '("1¿)' implies the estirnate of
the state at time 's' given lneasurements up to time 'ú', with s ) f .
Papadourakis 8z Taylor [31] derive the fllter equations shown in Ðquations 3.46 & 3.47 for
a rectangular systolic array configured for matrix multiplication.
*(å + 1lk) : A(k)x(klk - 1)+ A(Å,)K(k) þ(k) - c(k)x(klk - 1)l (3'46)
P(fr + 1lk)
where
K(k):p(klk-1)c"(k)* fclr'¡e1elk- l)c"(,t)+v(,k)] 
I 
(3.48)
The covariances \M(k) and V(h) may be assumed to be zero, although they are left in for
cornpleteness6.
To implement the Kalman filters of Equations 3.46, 3.47 k 3.48, care must be taken to
arrange the order of multiplications correctly for maximum efficiency. As there are a relatively
large number of rnatrix multiplications per iteration, and the number of elements in the state
vector will result in a small numbel of partitions, the orderiug of the calculations should be
considered in two separate cases) those of:
1. The number of state va,riables and measured variables are both smaller than the dimension
of the processing a,r'ray.
2. Either the nurnber of state variables or the number of measured varìables or both are
larger than the dimension of the processing array.
For the first case, the oldering of conrputa,tion, together with the labelling of intermediate
calculations. is shown in Table 3.1. The table shows the procedure for calculating the Kalman
filter directly, by showing the required inputs, the resultant calculation and a temporary name
for this calculation, ancl whether to unload the calculation at the completion or to leave it in the
array for accumulation. This ordering, assuming a bank-swapping memory architecture, requires
only two delay slots, one before and one a,fter the calculation of the inverse of C(k)f(Ä;1k -
1)AT(k) + W(k). These are required as the following calculation relies on the result of the
current calculation. While these delays a,re not strictly necessaïy, as a future calculation can be
brought forward to fiil the delay, they are retained here to maintain coherence of the procedure
by implementing the procedure directly.
If the climensions of the state or observation vectors are larger than that of the processing
array, the computations can be partitioned, which will help 'hide' sorne of the unload operations
while other partitions are calculated. Table 3.2 shows an example ordering of the partitioned
form. The subscripts following a matrix indicate the partition of the natrix, ie X(k)¡¡ implies
the r.¿À row partition anrl t,he jtl'colurnn pa,r'tition. If a colon is included. either a complete row
or a complete column partition is implied, ie X(fr)¿, is all columns of the i¿â row partition and
X(k),; is all rows of the j¿â column partition.






































































P(Ælk - 1)cr (e)
v(k)















Input 1 Input 2 Calculation





V(¿) - c(A)r,x(klA'- 1) + y(k)
V(t) A(k)r,x(klk - l) * Nr,z









P(klk - 1);,Ai (l'),¡
A(A)¿,K(k),,
c(k);.M,i



























Table 3.2: Ordering when Number of State Vectors is Large
36
The main inefficiency of directly implementing the routine on the matrix engine is Equa-
tion 3.46. Using a constant bandwidth array, all but one of the columns of the array would be
fed with zeros, wasting a great deal of the potential performance.
Example performance estimates for simple systems are derived in Chapter 7. These suggest
that the systolic matrix processing array will be very useful for calculations in large Kalman
fllter systems, and even possibly extend the range of applications for which Kalman fllters are
used to include hitherto unconsidered problems.
3.7 Conclusion
From the analysis of these few algorithms, it ca,n be seen that a relatively simple processing array
can perform more complex algorithrns by using clever interface hardware. This combination will
increase the versatility of the array while decreasing the cost compared to other systems of
similar performance. The range of applications that the array can be used for extends into flelds
such as aerodynamics and fluid dynamics, embedded high speed signal processing and control,




As will be shown in Chapter 5, the address generator in the data controller is the main bottleneck
of system performance. The address generator is irnplemented using an Extended Marwood
Difference Engine, which provides modulo arithmetic using only the addition and subtraction
operators [OS]. iltrerefore, any improvernent in the speed of addition and subtraction will result
in a faster difference engine/address generator, and ultimately better system performance.
To maximise performance of this critical path without resorting to exotic technologies, fun-
damental arithmetic was reviewed, and some solutions pïoposed. In this chapter, we will explore
the possibility of using non-conventional number systems to reduce the computational delay of
an addition/subtraction operation. Speciflcally, a, signed-digit arithmetic system, in which digits
may take the values of not only zero and one but also negative one, is presented for use in the
address generator.
In the second part of the chapter, division and multiplication is examined. The previous
chapter on algorithms indicated the need for a divider in the data,-input path to the array for
the efficient implernentation of Gauss-Jordan elimination. The data controller, which supply
data to the processing array, is the logical position for the divider, as the input data must
pass through the data controllers on its way to the processing array. An iterative divider that
provides a good speed/area trade-off ivhile still performing fast enough to not significantlv affect
the overall performa,nce of the system is also presented, together with a signed-digit multiplier
that is used in the divider.
4.1 Signed-Digit, or Carry-Propagation FYee, Arithmetic
Although the concepts of signed digit arithmetic are not new [3, 4, 91], it is only more recently
that they have become a viable alternative, due to the availability of VLSI circuits in which
speed and not size is the major concern.
In a 'conventional' number systen, such as two's complement binary or IEtrtr 754 floating
point, each representable number has a unique representation, thus maximising the range of
numbers that can be represented in a given number of bits. The redundant binary representation
allows more than orìe representation of most numbers in the nrtmber set's range by usìng the
'digit set' of {-1,0,1} (-1 often represented bV T ). Therefore, the representable range is not
optimal for the number of bits required, although, as we shall see, there are other benefrts.
Figure 4.1 shows an idealised version of a single digit cell in a signed-digit adder. The
figure is idealised because it has been broken into thrcc clistinct parts, or stages, to demonstrate
the carry-propagation path, whereas in practice, all the parts are tnerged to redttce hardwa're















Figure 4.1: Sign-Digit Addition Cell.
Consider two numbers, X and Y (say), applied to the two inputs of the f¿â digit cell. As X
and Yare both signed-digits (ie each in the range {-1,0,1}), the sum of the twois in the range
{-2,-I,0,I,2}. It is obvious that if the answer is'2', then the'*2 out'signal is flagged true'
But what about if the answer is '1'? I)ue to the availabilitv of negative numbets, '1'could be
represented as either '+2 out'plus'-1' or'l'. This is clue to the redundancyin the system'
The idea, then, is to minirnise the set of possible outputs frorn ea,ch part of the digit cell.
Therefore, in the case mentioned above of X + l'' : 1, the choice would be to flag 'f2 out' ' as
this reduces the set of possible outputs to the uext stage to {-2,-1,0}.
Similarly for the second stage, the input from the flrst stage in the lartge {-2,-1,0} and the
'+2 out'from the previous digit translate<1 to the range {0,1} due to the increase in significance,
results in the output being in the range {-2,-1,0,1}. If the '-2 ottt'flag is set for the cases of -2
and -1, the output from the second stage will be in the range {0,1}.
The third and frnal stage must output a number in the range {-1,0,1}, which is obviously
the case. The input to the final stage fron the second stage is in the range {0,1}' and the input
from the '-2 out'of the previous digit is either '-1'or '0'. Therefore, the output from the final
stage is in the range {-1.0,1} - another signerl-digit uumber.
4.L.L Why SD Arithmetic
The obvious question now is - why go to all the trouble of c.onversion to and from a redundant
system and also pay the penalty of the extra area requirements? As hinted above, the answer
to this is that a redundant system is 'carry-propagation free'. While each cell is not entirely
independent of other cells. the dependency can be traced back exactly two cells.
To see this, it's easiest to work back from the output stage. Figure 4.2 shows the propagation
paths for the 'iúÀ' cell in an adder.
The frnal output is dependent on the output from the second stage (i-)and the'-1 in'signal
from the previous cell (( i - 1)n,¡1. The '-1 in' signal from sta,ge (i - 1)- is dependent on the
output from the frrst stage of cell (, - 1) (ie stage (.i - 1)¡) and the '*1 in' signal from stage
(.i - 2)¡. As the 'f 1 in' fol stage (i, - 1),, is actually '+2 out' from stage (i - 2)¡,which is








Figure 4.2: Sign-Digit Propagation Path.
dependent only on the two (signed-digit) inputs, the longest carry propagation path is frorn the
two inputs of stage (i - g)¡ to the output of stage z, - ie. two cells in length'
Now the two main advantages of Signed Digit Arithmetic become obvious. These are:
o faster calculation due to the reduced carry-propagation path. This path is now 2 cells
long rather tha,n'n'cells for a carry-propagate adder. Although alternative schemes such
as Carry Look-Ahead and Carry-Select have reduced computation times (log(n) and /n
respectively), the¡' both have a greatly increased area over a standard ripple carry' and a
propagation time still dependent on the size of the clata.
o as the calculation time is fixed at the propagation time through two cells (independent of
operand size), it is easier to match stages in a pipelined systern with varving adder lengths.
This is the case for a multiplier. in which the secoud stage adder will be longer than the
first, the third stage longer than the secottd, etc.
The third advantage is a bit more esoteric. As IIìliE floating point expresses the mantissa in
as an unsigned binary nurnber with a sign bit, conversion to signed-digit form is simply a matter
of making the rna,gnitude bits of all digits eclual to the rnagnitude of the corresponding mantissa
bit and the sign bit for each digit eclual to the sign bit of the floating point number. Thus no
addition is required as is the case when Itrtrtr floating point is couverted to two's complitnent'
4.t.2 Sign Magnitude Specifics
As there are three distinct values in the number set {-1,0,1}, trvo bits will be required to en-
code the three possibilities for each digit. Genera,lly, there can be held to be two signed-digit
representations of this number set. These are:
o Signed-Bit Form
The name Signed-Digit is often used here instead of Signed-Bit, but can often be c.otrfused
witlr the conce¡tt Signed-Digit rather than the implementalzorz Signed-Digit. In this no-
tation, cach bit represents a number of magnitude one or zero, one representing nega,tive
one and the other representing positive one. Thus the name refers to the fact that each
bit represents either a positive or a negative unit. Then the sum of the combination of the












The combination of the two bits, labelled r¿ and p for negative one and positive one
respectivelv, can be summarisecl in a table as shown below (Table'1.1)'
Table 4.1: Table of SD representation
o Sign-Magnitude Form
Similar in concept to the familiar sign magnitude form of the IBBtr floating point standard.
this approach provides a sign bit for euergbit in a number. The two bits representing each
digit can therefore be individually referred to as the sign bit, s, and the magnitude Ï¡it m.















Table 4.2: Table of SM representation
Note that the sign-rnagnitude nota,tion implicitlv includes a sign for every number, includ-
ing zelo. Therefore, thele ale clistinct positive anrl nega,tive zeros, unlike the signed-digit
forrn.
4.1.3 Conversion To and FYom Signed Digit and Twots Complement
The main disadvantage of any nerv number format is that converting existing formats to the
new one, and back again, can be complex. The latency of such a conversion done 'on-the-fly' is
often too large to make a new format practical2. However, one of the rnain advantages of the
redundant arithmetic described a,bove is that conventional binary notation is a subset of the
redundant representation. making conversion fast and simple. Note that the description here is
limited to two's cornplement conversion. although floating poiut c.onversion is even sim.pler3.
2This is generally the case for formats like residue arithmetic [92]
3As the IEEE floating point standarcl is in sign magnitude form [32], all that is needed to convert to signecl-
digit form is to assign the sign bit of the IEEE format number to all the digits in the signed-digit representation.
Conversion the other way requires that the sign of the signed-digit number is determined, and assigned to the
unsigned version of the signed-digit number
4I
Converting Frorn Twots Cornplement Notation
The conversion to a signed digit number from a positive, two's complement number is simply
a matter of assigning each of the magnitude bits in the signed digit representation the value of
the corresponcling bit in the two's c.omplement lepresenta,tion. The sign bits of all digits are set
to zero. as shown in Figure 4.3.
Positíve 2's Complement
s M s tl s M SMSM
Sìgned Digit
Figure 4.3: Converting Positive 2's Complement to Signed Digit.
Conversion of a negative two's complement number to signed digit involves converting to a'
positive number and setting all the sign bits in each digit to oue. To convett a negative two's
cornplernent number to a positive one, all the bits are inverterl, and one is added. The addition
of one is often done 'on-the-fly' by applyìng a, one to the 'carry-in' of the least significant adder'
However, as the signed digit number is actually a negative number, negative one must be applied









Figure 4.4: Converting Negative 2's Complement to Signed Digit.
Therefore, all that is required for the conversiou from two's complernent to signed digit
notation is a series of inverters to complement a negative two's complement number, and several
multiplexers controlled by the rnost signiflcant bit of the two's c.otnplernent number number to




Converting To Two's Complement Notation
This is a little more tricky than the other way round (Section 4.1.3), as each digit in a signed
digit nurnber can be either negative or positive. One possible approach is to start at the most
significant bit of the signed digit number, and add or subtract each successive lower signiflcant
bit in turn. Another is to convert the signed digit number into a 'positive' number and a
'negative' number, ancl then perform a propagate addition. The latter arrangement appears to
hold more promise, as conflgurations such as Carry-Select and CISA adders can be used to speed
up the propagate addition, whereas the former is an inherentlv serial operation.
Figure 4.5: Converting Signed Digit to 2's Cornplement.
4.L.4 Signed Digit Implementation
In the practical implementation, there are nìany simpliflcations that are available by combinilg
all three stages of the adder. The onlv requirernent, then, is to recall that the 'pos-out'signal
depends only on the two signed ma,gnitude inputs, the'neg-ouf'signal depends only on the two
signed digit inputs and the'pos-in' input (pos-out from the previous stage), and the signed digit









































































"All the elements are essential for this Karnaugh Map; the surrounding circles are
rernoved for claritv
Due to the redundancy available, there a,r'e several simultaneous degrees of freedorn open to
the designer. In the Karnaugh maps above, 'essential' combinations are c.ircled, except for the




is essential for maintaining the requirements of Signed Digit (propagation free) arithmetic. For
example, if either (,9r,Mt) or (56,Ms) or both ecluals positive one and neither equals negative
one, then the posoul signal must be asserted. Additionally, once posout has been asserted, the
fr€gout can be set to ¡ep¡esent the lumbel '+1' as either +2 - 2! I (posou¿ + negout * 1) or
+2 - | (posout -1). Many other such possibilities exist. The groupings in Ðquations 4.1to 4.7
were chosen as they appeared to minimise the logic equations. However, due to so many options
being available, a cornpletely automated process is not possible. The program 'BSPRESSO'4






st.M t rM.so +W.M;
(St.Mt M1 so).(Mr.Mo)
(Mt @ Ms).pds* i Mt.So.Mo
((Àfr O Ms).øõs"").(M1.So Mo)
A[t Ø Mo pos;).(Mr.So.Mo)
ne9in
(((¡¿r e, Mo) @ pos¿n) Ø ne,g¿,,.)














The logic equations ( 4.1 to 4.7) above have all been converted from AND/OR form to
NAND/NAND form. AIso, as the adders will be used in the address generator which is a critical
bottleneck, each term is calculated with its conjugate, where the conjugate is required. This is
only an inverter in the cases of Xl[s¿¿ and neg*t, bú [s*¿ is computed simultaneously with
posout. TheotherpointtonoteistheorderofevaluationofthetermAl[ou¡. Mor¡isassertedif an
odd number of the four inputs to a cell are a,ssertecl. A nrore balauced exclusive-or combination
to a,chieve this function is
h[:ú : (( ùlr ,] Mo) e (pr"¡n, Ð negin))
However, the terms Mt k I[s àre available initially, posiT¿ fronr the previous stage is available
only after two gate delays, a:nd neg¡n is only available a further two gate delays after pos¿',.
Hence, the orclering in Equation 4.6 uses the tenns as they become available, resulting in a
better balance and fewer spikes.
End Conditions
Although a single n -digit adder can be composed of tr concatenated single digit adcler cells,
there are c.onditions for the first and last digits of the adders that simplifv the requirements of
those cells. For example, the flrst celi in a signed digit adder does not need the 'posìn' and
'negìn' propagate signals, and so these can be eliminated frorn the logic ecluations. In such a
case, one of the several sets of possible Karnaugh maps for the output is:
4ESPRESSO is a Boolean logic minimization programme from the Octools design suite, written at the LIni-




















st t \vh (4.e)
,Sout = M¡ (4.10)
If o,,t : X,Io q) XIr (4.11)
If the number being applied to a signecl-digit adder is a two's cornplemeltt number. then the
conversiou to signed digit notation requires that there is a 'negìn' signal, but a 'posjn' signal
is uot tequired (seer Section -1.1.3). 'I'his is the case for the a,clders in the acldress generatols,
which used two's complernent notation for the'delta's (see Clhapter 5). A..second requiremettt
for the adders in the acldless generators is that the number 'zero' has an exclusive positive
representation, ie the number 't0' must be represented as 'f0'. I'his constraint, added to the
lack of a 'posìn' signal, procluces the following set of possible Karnaugh maps.





= sr.M, + L,h^/Io (4.13)
So,t : negin.(Mt û Mo) (4.1'1)
L[out : (tVlr û Ms) @ n,eg¿n (4.15)
Similarly, the last digit cell in the adder is not requiled to generate thc'pos-out'and'neg-out'
signa,ls, although these may be of use if overflow detection is required. However, in sottte cases
such as the multiplier, there nray be only one input to a,n adder cell, plus the propagate signals.
The simpliflcations thus made are shown in Figure 4.8 and in Equations 4.16 to 4.19.
Figure 4.6: c) d)
Figure 4.6: Karnaugh Maps for First Digit Cell of an Adder





























































































Figure 4.7: Karnaugh Maps for First Digit Cell of an Adder
Figure 4.8: c) d)
Figtre 4.8: Karnaugh Maps for f,ast Digit Cell of an Adder with Single Input
47





: Ttos¿n ! IVI¡,,
S'out : rlegiTL
LIo,t = (]lI¿" Ð posin) Ø negi,,
S2out : pdSt"




Aclditionally, the final propagated orrtputs, 'pos-out' and'neg-out', can be converte<l from
signed bit form to sign magnitude form, and applied as inputs to a,nother adder. Rather than
generating pos-out and neg-out, the sign magnitude outputs, '52' and 'M2', can be generated
directly. The logic equations for the new outputs are
(4.20)
(4.21)
with ,9or¿ k Mout the same as in Ðquations 4.18 & 4.Lg
4.1.5 VLSI Layout and Implementation
Three separate signed digit adders were ultirnately implemented, two of which incorporated
latches into the adder to facilitate cyclical addition/accumulations and pipelining6, and one
of which incorporated multiplexers after the latchesT. The simplest contained no latches or
multiplexers, and is used in a non-pipelined rnultiplier8.
The unlatched adder, the simplest of all, is shown in Figure 4.9. The input sign bits are
applied to the top of the cell, and do not recluire a, bus line as there are only three connections
between the two inputs. Ihe input magnitude bits occupy the bottom four internal bus lines,
which can be connected to by eithcr running metal2 down through the cell or by the side of
the cell. The 'posjn' a,ncl 'negjn' signals are applied at the left hand side, and the generated
'pos-out'and 'neg-out'signa,l are output on the right hand side. The signeä cligit output number
is generated at the bottom of the cell.
The overall size of the cell when irnplemented in 0.7¡rrn es2 CMOS is approÌimately 0.089mmx
0.103mm = 0.0092mm2. The cell was sirnulated using HSpice for all possible input combinations
of the propaga,ted and applied signals, witli the propagated input signals being delayed by the
salne a,mount as it was cleterrnined for the output propagated signals to be generated. The adder
was found to function correctly with a cycle time of just under five nanoseconds. The HSpice
simulation results are shown in Figure 4.10. The es2 process is a double-metal, single polysili-
con process with a 0.7 micron feature size. The simulations were run using 'typical' parameters
throughout, with a flve volt supply ald assumed temperature of twenty-five degrees Celsius.
The es2 process is from European Silicon Structures, which is available to research institutions
from TIMA-CMP, in France.
The latched adder is shown in l¡igure 4.12. The latches are standard pass-transistor and
two inverter designs, as shown in Figure 4.11. The input data is latched into the adder, and
then applied to internal buses from which the outputs are derived. The data flow is again left
5as needecl in Chapter 5 {or address generation
6for a multiplier
Tfor selection of the correct address after the modulo operation
srequired for the byte divider of Section 4.2
48
Figure 4.9: Unlatched Signed-Digit Adder Cell






































- Sñ AODER.UL I- x0--
- sû
E X-ADDE R _UL
P-OUT











Figure 4.1-0: HSpice Plots for lJnlatched Adder Cell
49
to right for the propagate signals (posin, pos-out, neg-in, neg-out), and top to bottom for the
generated signals (sìn, s-out, mìn, m-out). For the 0.7p,m es2 CMOS plocess' the size of a
latched adder cell is approúmately 0.097mm x 0.24mm : 0.023üffi2, which is approximately
three times larger than the unlatched version. Note, however, that the latched version contains
much more unused area that woukl be eliminated during a compaction phase'
Figure 4.11: Layout of Latch Cell
Figure 4.12: Latched Signed-Digit Adder Cell
The HSpice simulations of the latchetl adder indicated that the functionality was cottect,
and that a new addition could commence every five nanoseconds. The simulation results are
shown in Figure 4.13.
The third adder cell was one designed specifically for the ¿ddress generators described in
Chapter 5. These include a pair of multiplexers at one of the inputs after the latch. The reason
for this is to allow the multiplexer control signal to settle before being used, and to reduce the
driving requirements on the output drivers of the adder cell.
The layout of the third adder is shown in Figure 4.14. The adder was also tested using
HSpice, and found to operate correctly.
50






































Figure 4.13: HSpice Simulation for Latched Adder Cell
Figure 4.74: Latched Adder Cell \Mith Multiplexer
51
VLSI Implementation of End Cells
The first cells and final cells describetl above in Section 4.1.4 were all laid out in 0-7p'm es2
CMOS. The unlatch beginning cell is shown in Figure 4.15, and the HSpice simulation results
are shown in Figure 4.16.
Figure 4.I5: Layout of Unlatched Beginning Cell





































Figure 4.16: HSpice Simulation Results for Unlatched Beginning Cell
The cell is approximately 0.089mrn x 0.061mm: 0.0054mm2 in area. All the output signals
are available less than two nanosecontls after the inputs are applied.
If the latches are added to the cell, a significant extra area is requirecl. In this case, the
latches consume approximately 31 percent of the total area. However, as the cell will be only
a small part of a complete adder, the excess area is not a major concern. The layout of the
latcherl begìnning c.ell is shown in Figure 4.17.
When the multiplexers and the facility to apply a two's complement number are added for
the cells used in the address generator, the area penalty of adding latches is mitigated. The
52
Figure 4.I7: Layout of Latched Beginning Cell
VLSI layout of this beginning cell is shown in Figure 4'18
Figure 4.78: Layout of Latch & Multiplexed Input Beginning cell
The area consurnetl by the c.ell is 0.10mm x 0.24mm : 0.024mm2, which is approximately 30
percent more than for the latched beginning cell, and about four times the area of the unlatched
cell.
The encl cells, denoted sm-end and sm-end2, depending on whether the propagated outputs
a,re plus two and minus two ({-2,12}) or a sign and magnitude bit ({+,2}) are shown in
Figures 4.19a) & b) respectively for the unlatched versions. Both cells consume an area of
0.0892mrn x 0.0785mm - 0.00700mm2 when implemented in 0.7pm es2 CMOS.
The end cells were both simulated using HSPICE, and were found to operate correctly. The
simulation results are shown iIr Figures 4.20 a) & b).
As can be seen from these plots, the end cells all settle well before five nanoseconds' even
for a propagated input delay of 1.5 nsec.
'lb pipeline the multiplier, the latched version of the adder was modified to produce a lal,ched
53
Figure 4.19: a) Unlatched End cell with standard Propagated outputs
Figure a.19: b) Unlatched End Cell lMith Sign Magnitude Progopated Outputs
54


































sl_Eir-uL - I ¡
Sil-END UL.I 'il_0ut
tNl
Figure 4.20: a) Unlatched End Cell With Standard Propagated Outputs Simulation
Results


























Figure 4.20: b) Unlatched End Cell With Sign Magnitude Progopated Outputs Sim-
ulation Results
55
end cell. The VLSI layout of the latched end cell with standard propagate output signal is shown
in Figure 4.2Ia), and with sign magnitude outputs in Figure 4.21b). When implemented
it O.7p,m es2 CMOS, they consume 0.0972mm x 0.1196mm : 0.0116mm2 and 0.0972mm x
0.1288mm: 0.0125mm2 of silicon respectiveþ.
Figure 4.21 a) Latched End Cell With Standard Propagated Outputs
Figure 4.2I: lt) Latched End Cell \Mith Sign Magnitude Progopated Outputs
The HSpice simulations of the latched end cells are shown in Figures 4.22a) & b)' These
simulations show that the cycle time between latched operations is less than five nanoseconds,
as desired.
4.2 Byte Divider
As rnentioned in Chapter 3 on Algorithms, if Gaussian elirnirration is to be run on the matrix
engine, some form of divider is required to find the inverse of the pivot elements. The requirement
here is for an area,-efficient divider which is rela,tively fast. In effect, if the matrix engine is to















¡ H5PICE TILE CflEATEO 
'OF 
CIflCUI f 5h ENOOL P


















Figure 4.22: a) Latched End Cell \Mith Standard Propagated Outputs Simulation
Results















Figure 4.22: l:) Latched End Cell With Sign Magnitude Progopated Outputs Simu-
lation Results
57
the matrix array)is required. For a tvpical system, this may be in the order of 150 to 400 nsec,
depending on array size, technology, etc.
A scheme that produces a divider that is both relatively small ancl also fast is one based
on higher ladix division. This is, in turn, based on the Newton-Raphson division algorithm
that has been well documented [92, 32, 95]. A brief introduction to Newton-Raphson division
is provided in the following Section, including sta,rting and error considerations, followed by the
conversion of this approach to higher radix division.
4.2.L Newton-Raphson Iterative Divider
The concept behind Newton's itera,tion for root finding is that a, series of iterations of a basic
formula will produce an approximation to the true root of a function at each iteration that was a
better approximation than the last. If the exact solution to the root is at r, such that /(ø) : O,
and the iÚÀ approximation to the root is z¿, then the next approximation to the root will be the
old approximation plus a difference term.
ri+r = r¿ * L'r¡ (4'22)
The slope of a line draw from the function at the initial estimate to the function at the subsequent
estimate will be:
f @;)mi x -liî Ø.23)
where the negative sign is due to the function at the subsequent estimate being rnuch closer to
zero than at the initial estima,te.
If the slope , m¿, is made to be the tangent of the function at the apploximation for z such
that
nt¿ : ft(r;) (4.24)
then the equation for the next approximation to ã becomes
rüj+l = *, - /E¿ Ø.2b)
Equation 4.25 is the Newtort-Ra,phson fornlula, fol root finding.
To apply Ðquation 4.25 to the case of division, it is first necessary to fomùlate the problem
in terns of lìnriing the root of a function. The equation
"f(r): !-6 U.26\T
lrasarootatthepointr:llb,andsocanbeusedtoflndtheinvelseof 1;henumberrepresented
by'å'. Noting that /'(z) - -1, Equation 4.25 catt be formulated as
I þ¡)ri+r
f'@¿)
r; (2 - r¿b) (4.27)
Tlren Ðc1uat\on 4.27 successively a,pproximates the inverse of the number 'b' to the actual inverse
1:.
For the sake of consistency with other texts, it is often the case that the'z'va'riable in
Bqua,tion 4.27 is leplaced by 'q'for the specific case of the division formulation of Newl,ott's
iteration, and the eractinverse of b is denoted as'Q'. The new representation is:




Division of Floating Point Numbers
To apply Bquation 4.28 to a floating point number, it is necessary to consider the ra'nge of the
divisor and the cluotient.The number will be composed of a mantissa field, a sign bit, a 'hidden'
leading one ancl arì exponent field.
As the rnantissa is a fractional nurnber. the mantissa a,nd the hirl<len bit together comprise
a, nnmber in the ïange I < l.f < 2, where the term '/' represents the rnantissa. The inverse,
tlrerefore, is in the range + > + t å, i" 0.5 < inu ( 1. The new exponent, of couÌse, is simply
the negative of the old exponent.
Self-Correction and Starting of Newton's lteration
In the past, a Newton's Iteration divider often obtained its initial guess at the solution using a
Read Only Memory (ROM) look-up table. However, consideriug the speed improvement that
has occurred recently in CMOS VLSI, the speed advantage offered by such a table may not
justify the amount of area it consumes. An alternative approach is to guess an approximate
solution, and rely on the self correcting propertv of the algorithm.
Consider the '(f + 1)'/" iteration that produces the estima,te of the inverse 'qi+r', for which
theerroris'e;11'. If thecorrectvalueof theinverseof 'ó'is'Q',thentheerrorafterthe(¿+t)tÀ
iteration can be obtained using Equation 4.28.
Q¡+'¡ 
= 
fi{f iî-:''"" -"'-u'7 (Qxb=t) (4.2e)
Therefore, the precision of each iteration is the squale of the precisiou of the iteration before it'
The important thing to note is that Equation 4.29 alwavs gives the same answer irrespective
of the sign of e, and that the be! term is prefaced by a minus sign. Therefore, even after an in-
c.orrect initial guess that estimates the inverse above the true inverse, sttccessive approximations
will approac.h the true inverse from below.
This important fact can be used to flnd a'good'starting point.for Newton's iteration. One
possible starting point for the iteration is the guess oe : 0.5, as we know that all solutions will
be above this point, and we are trying to approach the solution from below. This approach has
the advantage of simplicity. \!'e know that the lirst (fractional) bit must be one, so it's eas.y just
to set it and let the self correctìon of Newton's rnethod look after the relatively poor guess.
Another approach is to recognize that the larger the dividend, the smaller the quotient. Thus,
an estimation scheme that returns the maximum possible quotient for the minimum dividend,
and vice versa, may have sorne merit. Such a scheme is proposed in the text by Burgess [11]'
the choice of sta,rting estimate being:
Ço --
1-r if 1<ó<1.5
I - + otherwise
(4.30)
where '/' is the fraction part of b.
The errol associated with each can be easily calculated, as shown below in Bquations 4.31
þ, 4.32. Here, 'b' is the number to be invertecl, q¡ is the initial estimate ancl Q is the actual
inverse. Tf the initial estima,te is chosen to always b" qo - 0.5, then the error is a liuear function
lt9
dependent on the actual quotient, ie
ts : Q-qo
: Q-0.5 (4.31)
Alternatively, if the initial estimate is chosen to be as defrned in Equation 4.30, then the






























Comparing the errors between the two starting schemes graphically leads to the plots either
sepa,rately in Figure 4.23, or together in Figure 4.24. Note that only the magnitutles of the









Figure 4.23:F;rrorgraphsfor a) 0.5<Q <? b) 3 < I <7
As can be seen in these flgures, the difference between the schemes is quite large. This is
apparent in Figure 4.25,
Although the initial estimate of qo - 0.5 is very simple, using the other starting estimate
method (Equation 4.30) is not much more complicated, and produces a much better estimate.
The error graphs shown in Figure 4.24 also indicate that the complexity of Equation 4.30 can
be reduced with a minor estimate accuracy penalty by noting that the choice 8o : | - { t t.
error characteristics over the range 1.0 < ó < 1.5 (3 < I < 1) that are very similar to the choice
Qo : I - / over the sarne ïange. Recalling that the error in the initial estìmate is
ifl<å<1.5
otherwise

























0 5 o.ss 0-6 0.6s o-1 0.?5 0-8 0.85 0.9 0'95 I







0.5 0 - 55 û.6 0.65 0.? 0.?5 0-8 0-85 0.9 0.95 1






































If the magnitudes of the error intcgral terms are added, then the total error integral using
Bquation 4.30 directly will be 0.0166+0-00894 = 0.0255. If the modified version of Bquation 4'30
is used such that qo always ecluals l- +, then the total error integral will be 0.0166f 0.00894:
0.0284. Therefore, using the sirnplified starting approximation of always a,ssuming qo : 7 - t
will result in only at llTa increase in the starting el'ror. Just for compa,risou sake, if the initial
guess is chosen a,s qo - 0.5, then the total errol integra,l is
¡l
.lo*\Q - 0.5) elQ : s'12Y'
which is more than four tirnes larger than using q0 : 1- +' Thus a' good comprornise i.
minirnising error a,ncl reducing cornplexity would be to use the estinrat€ Ço : 1 - { alwavs'
Convergence
The convergence of au itera,tive rnethod of root finding is the I'ate at which the approximation
corìvel.ges to the actual solution, if it does at all. As the rate of reduction of error is related
to the rate of convergence, then Bquation 4.29, which was used to <leterniue the error between
any estimate and the actual solution, can also be used to determine the rate of convergence and
also if convergence will actually occur. Ðquation 4.29 is reproduced below for convenienc'e'
4i+t:Q-ltr?
Also rec.all that the estimate q; is clifferent, form the exact allswet'by an amount €i such that
(i=Q-qi (4.36)
were €i will in general be positive, except for the ca,ses of either i : O (q¿ is an initial estimate)
or a numerical error has occurred. It is not a requiretnent that e; be positive.
62




Therefore, the error reduces qua,draticallg if the estimate to the actual answer is sufficiently close.
If the error reduces quadratically, then the precision'increases quadratically at each iteration.
Hence, if the initial estimate is c.orrect to two places of precision, the first correction will be
correct tofour places, the next will be colrect to eight, etc. If 32 place of precision was required,
then lour iterations would be required, while 64 places of precision would require five iterations'
4.2.2 Byte Divider
This is, in effect, division using a higher radix. The term'byte divider' refers to the fact that past
processors by Amdahl and Honeywell used an eight bit (one byte) radix [92]. The algorithm
for division presented here is based on the formula,e given by Flynn & Waser [92], although
subsequent to the design of the divider, it was noted that Briggs and Matula presented a similar
design at the Eleventh Symposium ou Computer Architecture [10].
Consider Equation 4.28, reproduced below for convenience,
7i+t = q; (2 _ q¡b)
The impetus for the higher radix division comes when it is considered what the precision of
the intermediate, or early itera,tion, results will be. If the fîrst iteration has one bit of prec.ision,
the multiplication b * qt is a,n N x 1 product, where 1/ is the precision of ó. In the second
iteration, an N x 2 multiplier will be requìred, in the third an N x 4 multiplier', and so on, until
an N x l[ multiplier is required. However, as the solution is approached from below, once the
first 'P'bits of precisiou have been obtained, there is no need to recalculate thern.
Once the first byte has been obtained, call it ds, then it can effectively be 'retired' from the
calculation. as it, wìll not be nrodifìed again. with the exception of a possible carry-propagate in.
Similarly, when the second byte, d1 , has been obtained, it will rernain unchanged except for a
possible carry-propagate in. Therefore, if we rearrange Newton's equation for division such that
we calculate the difference between successive itera,tions. then we ca,n arra,nge that the difference
is, in fac.t, one byte long. lìea,rranging the formula, is not at all diffic.ult, as it is simply a matter
of shifting a qr fronì the right hand side to the left, ie.
d¡+t -- qi+t - (li : S¿Q - b.q¿) - q¡
q;(r - b.q¿) (4.37)
so that
4; 8¿-t I rl¿
k¡-z+d¡-1)Id¡





Now consicler the difference terrn on the right hand side of Ecluation 4.37. Substituting in
the result of Equation 4.38 gives
t- b.no : I -b.(doI ú 1...I rl¡)
Defining zi: | - b.q¿, the differencc' term, then
aI : l_b.qo
: 1,- b.(doIúI...I tl¿_t) -b.tl¡





Equation 4.37 now becornes
d¿+t : qi.zi
: Ç¿.(z¿-t - 6.d";) (4.40)
which is an z x 8 product for a1 n- bit rrumber 'å', followed by a sum (z¡4 - b.d'¡), which is stored
for the lext iteration, and finished by a product. As only the leading digit (byte) of z¿ is used
for the final product, then the final pr:oduct is actually an 8 x 8 multiply. Once the difference
d¿11 is obtainecl, the approximation to the quotient Qi+r can be forrned. Note that, although
the cluotient is the sum of the estimates that are calculated at each iteration, the calculated
d,¿'s are largely disjoint, and so any previous estimate of the quotient will only be effected if a
'carry-in' exists. Therefore, if the summing is achieverl using signerl digit addition, which has
a maximum carry path of two digits and a constant addition time, then cycle times remain
constant throughout the whole proc.ess anrl hence can be minimized'
Starting Byte Division
The procedure outlillecl in Section 4.2.2 above assumes an initial starting bvte qo to use as a
starting point for future iterations. Although this could be obtainetl from a look-up table, the
table rnust contain approximatelv one thousand locations for eight bits of rcsult (not 28 - 256,
as division is a non-linear operation). This sort of table would consunre significant area and so
an a,lternative method is sought.
The obvious answer is to use Newton-Raphson inversion to obtain the first eight bits, and
then swap to byte clivision for the remaining tt, - 32 bits. If Newton-Raphson division is used
to forn a quotient estimate eight bits long, the ?¿ x 8 multiplier used in byte dìvision will be
s1fficie1t. Using the starting estimation schemes described above, two bits of precision ale
available initiallv. Two iterations of Newton's method will produc.e an eight bit result, requiring
two passes through the n x 8 multiplier and a single sum stage for each iteration. After a single
byte of the quotient has been formed, the algorithm is switched to byte division, which requires
two passes through the n x 8 multiplier, and two additions.
Why Byte Division?
The advantage of using byte clivision is obvious when it is considered that the size of the
multiplications that are required become very large towards the end of a standard Newton-
Raphso¡ type inversion, wheleas they are a constant size for byte clivision. Assuming that a





iteration will require two iter-ations to leach a il2-bi| solutiou atrd threc¡ iterations to reach a
64-bit solution.
Again recalling Newtott's itet'atiou,
qi+r -- q¿(2 - b'q¿)
and considerìng the case of 'tr' being r¿-bits of precision, thetr each iteration requires two r¿ x n
multiplications and one n-bit sum. To obtain the time advantage associated with a full New-
ton's iteration irnplementation, a full r¿ x n multiplier would be required for the flnal iteration'
However, for earlier iterations the multiplier would be under-utilised, performing n x \, n x i
etc. multiplications on afull nxn array. Iherefore, the areapenaltyof providing afull nxrL
multiplier becomes significant.
Not only is the area required for the multiplier reduced by a factor of four for 32-bit division
and by eight for 64-bit division when using byte division, in general the time required for a
32 x 8 or 64 x 8 multiplication is typically three flfths or one half the time requirecl for a full
square multiplication respectivelylo. Therefore, each iteration using byte division will generally
run faster, although more iterations will be required.
Assuming that a 32-bit quotient is required, and that a 32 x B multiplier array takes 3/5
times the time to complete a product than a 32 x 32 array, that a full multiplier array cannot
supply a solution except for at the final level of adders, and that an atlder takes approximately
the same time as a single level of a multiplier arlay) then a quantitative comparison of bvte
division against Newton's iteration can be made. The conparison is best made in terms of
signed digit adder cell delays. Therefore, a single sum operation takes one adder delay, a32xB
product takes three adder delays, and a 32 x il2 product takes five adder delays.
The flrst two bits of the quotient are ava,ilable initially. Therefore, two iterations of Newton's
method are required to ptoduce an eight bit lesult. The byte division hardware requires the
following tirne to produce eight bits of quotient:
TBytrDiu. B : 2x(Tu¿¿127,rr,¡¡)
: I4To¿¿r,. del,gs Ø'41)
whereas l;he Newton's itelatiol recluiles the sarne uurnber of operations, but through the larger.
slower nultiplier:
7. un s : z x (?r¿¿ l- 2T'mtit,)
= 22Trr¿¿"r' delays Ø'42)
To produce a 32-bit quol,ient, three itelations of byte division ale requirecl, using time:
TByt" Di'u. i2\ncl : I x (zr.¿¿ * 2Tmult)
24To¿¿", clelays Ø.43)
for a tota,l of 38 adder clelavs. Using Newton's iteration to produce the linal 32-biT" quotient,
only two iterations are recluired, using tirne:
TNR Er,rt : 2 x (7u¿¿ 1 21-u11)
22To¿¿", delry.s U'44)
rourirrg a signed digit multiplier arrayl a 32 x 8 array will use three levels of adders as opposed to five for a full





Therefore, the time to pro{uce a32-bit result will be 44 adder dela.ys, whic}r is a,ctually slightlv
sLower than byte division.
If the quotient is a 64-bit nurnber, then seven iterations of byte division a'r'e required after
the initial eight bit result to produc.e the flnal answer'. This will recluire time
TByt, Diu. 6/End, : 7 x (2?add -f 2Trnult)
56To¿¿", c!,elays Ø'45)
for a total time of 70 adcler delays.
Using Newton's methocl for the cornplete clivision will require lîve iteratiotts, for a tota'l time
of:
Twn ot, 5x (?r,1¿+2?'mult)
5x13
65To¿¿", clelays (4.46)
Therefore, on 64-bit division, Newton-Raphson division is ma,tginally faster than byte division
with a much gleater area penalty.
4.3 SxSMultiplier
As a multiplier is required as part of the iterative clivider, a, plelirninary design was undertaken.
In Section 3.4 on Gauss-Jorclan elimination, it was pointed out that an inverse (|) is required
every (p f 1) loads, where p is the dimension of the matrix array. However, a,s the divisor is
the ith element in the i¿À wavefi'ont, there are only (p - ¿) load cycles (p is the matrix array
dimension) ava,ilable until the wa,vefront has been completely loaded. Therefore, to prevent the
array fron stalling, the multiplier ueeds to be relatively fast, and so a multiplier tree should be
used.
A tree structure reduces the multiply time from order O(z) to O(log n) lor an n -bit multiplv.
For a 32-bit number, the recluction is a factor of #, while the factor b.ecomes ff for double
precision (6-1-bit). One vely corììmon mult,ipliel t,r'ee tha,t was investigateil is the Walla'ce tree'
shown in Figure 4.26. I{ere, CSA denotes a Carrv-Save Adder and CLA denotes a Carr¡--
Lookahead Adder. The Carry-Lookahead Adder is needed to convert the solution from CISA
form to standard binaly.
The Wallace tree wa,s dropped in favour of a binary tree using signecl-digit arithmetic. The
thlee reasons for this were
¡ The signed digit adder c.ells had already been desigled, and these providecl the small
propagation delays typical of redundant arithmetic systems.
o Although signed-digit notation lequires two bits to represent three numbers 1T'0,1¡. the
Carry-Save Adders in Figure 4.26 require more hardware than propa,gation adders such
as the CLA or Carry-Select, as carry bits mrrst be saved at the outputs of each register.
Therefore, the area fol both the Wallace tree and the signed-cligit multiplier tree are both
order O(n2log n) [91].
o As the multiplier will be used in an iterative divitler, tlLe t-rtttput will be fed back into the
inputs, after suitable additions. The major disadvantage of signed-digìt notation, that of
conversion back to standard binary, can be left to the end of the division process rather











ã7 A6 ã5 A1 â3 ã2 â1 AO
A b,A
Figure 4.26: Wallace-tree Multiplier.
O = Digit llultiPlier Cell




C.L.A. (Rcdundonl to Binory Convcrlcr)
Figure -1.27: Signed Digit Multiplier.
67
The rnultiplier implemented is the one by Takagi et ul. [91]. This is a high speed binary-tree
using signed digit arithmetic. A block diagram of the implenentation is given in Figure 4.27.
The digit addition cells are single bit ad<lers as described in Section 4.1. They take two
input digits and two carry bits, one generated in the previous cell, the other in the next previous
cell, and produce a single output digit and two carry output bits.
Thc digit multiplier cells simply perform multiplication at the digit level. As the two multi-
plìer inputs are each in the range {-1,0,1}, the output of the rnultiplier cell is the same ({-1,0,1}).
In fact, the multiplier ca,n be considered to act on one digit, according to the value of the other,
were tha,t a,ction is:
o set the output to zero
o change the sign but not the magnitude
o do nothing
In table form, this is
101
Output Inputl
The individual magnitude and sign bits combine to produce the following Karnaugh maps,











These Kamaugh maps produce the groupings
so" 




















An extract of the full multiplier showing the digit multiply cell is shown in Figure 4.28' The
layout shows the 'b' input bus running right to left and the 'a'input being applied directly to
the multiplier cell, although in the complete multiplier the 'a' input will run top to bottom on
a bus. In this way, the VLSI layout of the multiplier is very similar to the schematic in Figure
4.27, except that the bottom half has been inverted to ease connections.
Figure 4.28: VLSI Layout of Digit Multiply Cell
The complete layout of the signed digit multiplier using unlatched adder cells is shown in
Fignre 4.2g. A box draw around the multipiier, encompassing the full multiplier top to bottorn
uo¿ t"tt to right, consumes an approximate area of 0.82mm x 1.58mm : 1.30mm2, which includes
unused area. The distance between successive cells in the horizontal direction is 0.1184mm.
Therefore, extending the multiplier array to an B x 32 array would consume approximateþ
0.B2mm(1.58+0.1184 x24):3.62mm2 of silicon in 0.7pm es2 CMOS. As a sign digit adder
always takes the sa,rne time to perform a calculation irrespective of wort{ length, the cycle time
remains the same for an I x 32 as for an 8 x I multiplier.
Figure 4.29: VLSI Layout of 8 x 8 Non-pipelined Multiplier
The multiplier is too large to test exhaustively, so a few test vectors were applied. For each
test, the 'a' input was incremented from zero to eight and then from 248 to 255. The 'b' input
was set for a complete simulation. The following HSpice plots show the relevant outputs for b
69
= 1, 64, l4l,25lt. All outputs that al'e not shor,vn have a magnitude bit of zero.
r HSPICC T]LE LP€A]ED FOR CIffLU I !H HUL Ì ÚL
c6/0r,25 r0 c6 24
TRFNÌID FOR CI¡tUII ç¡ ¡UL] UL



















Figure 4.30: a) b - 1
I HsP{cÈ rrLE 3ltå¡t3,'?5 c¡Rcurr 
qH Érrr r Ùr
b)b-64






Figure'1.30: ci) b- l4L cii) b - l4I
The plots show that the rnultiplv tirne for an B x 8 three level signed digit rnultiplier is
typicallv less than ten nanoseconds, with a worst case delay of approximately twelve nanoseconds.
Ther.efore, allowing flfteen nanoseconds for an 8 x B multiply is a,dequate. The actual output
digits are difficult to read in the plots, so they are included as Tables 4.3 to 4'6. The first two
tables show the non-zero digit values. The tliird and fourth tables shows the output in terms of
the groups of six digits, in an effort to keep the viewed data to a minimum.
4.3.I Pipelined Multiplier
Although a pipelined multiplier is not required for the data controllers, the multiplier above was
pipelined to demonstrate the ease with which this can be done and the suitability of a signed-
digit approach to pipelining. In addition, due to the heat dissipation problerns in Gallium
70
I HEPICT FILÉ C¡EOìTD FOF CIRIUII 6H HULI.UL
s5/0¿l0ì t¿ 02 {5
CIPLU¡I !H fULI-ÜL


















Figure 4.30: ciii) b - I4L
r HEPr'L 



















èt0 0t 250 0N 100







































































































































































































































































































































TotalI)p -- DgDø- DnDo- Ds





























































TotalDtz - DrcD6'D¡Do- Ds
Table 4.6: Digit Outputs for Unlatched Multiplier, b-255
It)
ln ln ln ln ln





Figure 4.31: Processing Cell Using Pipelined Multiplier
Arsenide (GaAs) and the fact that the swìtching speeds of GaAs ate not requiredll, the lower
power and higher density of CMOS is becoming increasingly attractive, and future generations
of the processing array will probably be constructed, at least in part, in silicon CMOS. If that
is the case, a single high speed pipelìned multiplier can be used to 'simulate' the presence of
several processing elements, atthough each element will contain its own accumulator and I/O
port. The likely configuration would be as shown in Figure 4.31.
The signed-digit multiplier is particularly attractive to use as a pipelined multiplier due to
the constant delay through each adder layer of the multiplier, irrespectively of the length of the
adder. This means that each stage in the pipeline is perfectly matched, and no stage must wait
for any other stage. The latches can be placed before any adder stage, for a varying number of
pipeline stages. For example, if Booth recoding is used, an 8 x I multiplier comprises of three
signed digit adders in two levels, so a two stage pipeline can be implemented. However, if no
Booth recoding is tsed, an 8 x 8 multiplier will contain three levels of signed <ligit adclers, so
a, three stage pipeline ca,n be used. Additionatly, a two stage pipeline could be used, except
that one stage will have two adders and the other will have one adder, resulting in a pipeline
imbalance. However, if the multiplier array is (say) 64 x 64, there will be six levels of signed
digit adders, so the pipeììne could have two, three or six matched pipeline sta,ges. For the sake of
dernonstration, the B x I multipJier was pipelined into three sta,ges, one for each level of adders.
The 8 x 8 multiplier VLSI design is shown in Figure 4.32. The design uses 0.7pm es2 CMOS,
and consumes a rectangular area of 0.93mm x 3.2rnm: 2.976tnm2, although this includes some
unused space. The difference between successive cells horizontally is 0.25mm, so that extending
the multiplier to an 8 x 32 array be approximately 3.2mm +24x 0.25mm: 9.2mm in length,
and consume 9.2rnm x 0.93mm : 8.56mm2 of area.
The multiplier was simultated using HSpice, although again not exhaustively. The input for
the first simulation if shown in Figure 4.33a), which produced the results in Figure 4.33b).
The applied input was the number sequence'0 1 0 3 2 5 4 7 6 I 8 11 10 13 12 l5 74 241 240
243 242 245 244 247 246 249 248 251250 253 252 255 254'for the 'b' input, which was appliecl
for an 'a' input of '0 & 1'. As can be seen in the output plot (Figure 4.33b)), the result of the
simulation is an output of zero for the f,rst half of the simulation, followed by the same sequence











¡ HçPICE FILE CREAIE¡ FOR CIRCU]I Sf-TULI.P5
9q/l 0/l 3 l1 | 22t 27
Figure 4.32: VLSI Layout of Pipelined 8 x 8 Multiplier








































































Figure 4.33: HSpice Simulation Plots a) Input b) Output
ON N)
/Ð
as the input. A new output is available every f,ve nanoseconds, aftc'r an initial delay of fifteen
nanoseconds (tlrree cycles). Other sirnulations were run, and the multiplier found to operate
correctly.
4.4 Conclusion
In this chapter, propagation-free signed digit arithmetic was introduced and the conversion
process between signed digit and standard binary notations shown. Basic addition cells were
designed, Iaid out and simulated in the 0.7¡rm es2 CMOS process, and found to operate correctly
with addition cycle times of less than five nanoseconds. When the word-length independence
of the cycle time of signed-digit arithmetic is recalled, the celis can be used to implement very
high bandwidth computational systems.
Support arithmetic components were also considered, and a high-radix divider proposed that
provided very favourable speed/area trade-offs, while still providìng the performance required
by the system. To complete the design of the divider, a signed-digit multiplier architecture was




As the cornputation rates of cornputers have increasecl, the ability of a processor to generate
operand addresses at a correspondingly high rate has been severely tested. Specia,lised hardware
has been described in the literature by Marwood and others 169,721 that directly generate
fundamental a,ddressing patterus to help alleviate this bottleneck.
The majority of these papers, with the exception of those by Marwood, attempt to optimise
the address generation for vector-type operations, leaving matrix operations as a superset of
vector operations. Unfortunately, there are several matrix algorithms that do not map well onto
vector addressing patterns, such as prime factor mappings, etc. Marwood, however, considers
matrix algorithms separately, and has implemented designs that directly support many matrix
structures. Marwood's address generation difference engine is used here, due to its versatility of
function and elegance of design.
In fact, the address generator difference engine is the fastest component of the whole systern,
and the speed of the genera,tor is the main limit on system performance. The importance of
memory bandwidth for a, systolic alray was pointed out by Ka,tona in his la,ttice rnodel for
cellular algorithms [46, 58], which can be translated into operand ad<lress generation bandwidth
if the acldresses are generated with a slower cycle time than the memory system. Marwood
appliecl the results to the evaluation of matrix proclucts on a,two dirnensional systolic. alray [58].
a blief sllnìnÌa,r'y of whic.h follows.
The lattice model involves the concept of two rnatrices, A & B, moving to.,vìrd the proc.essing
array with velocities uo k u¡, respectively, to form the matrix product C within the processing
array. This is shown graphicallv in Figure 5.1frorn [58]. For ease, the processing array is set to
be a square 1/ x N array, whereas Ma,r'wood uses the more general rectangular M x N a,rray' I{e




where þ¡ and þB arc the bandwidth requirements of each input, and that the output bandwidth
(Écr) ..u be matched to those of the input by setting
2IV (5.2)iJc
T
where r is a time step.
By dividing the numbel' of processor operations by the time in which they occur, the com-
plta,tional ra,te can be determinerl. Bea,ring in mind that a multiply a,nd a'cctrmtrlate takes place












: lJ ¡n l,{ (5.4)
This is a slightly surprising result, as it inclicates that not only is the computational power
of the square processing array proportional to the square of the a,rray bandwidth, but also that
iT, is inuersely proportional to the computational speed of the processirg elements. The logical
explanation is that the computational rate, .R, in Ecluation 5.4, is determined for the case when
the rnatrix processol is being supplied with data at its maximum t'ate. The bandwidth required
for an N x lt/ processing array is the sarne as for a 2N x2N array with each element operating at
half the speed. As there are four times the number of systolic elements in the larger array, each
operating at half the speed of the smaller array, the larger array has twice the computational
rate using the same bandwidth.
As Marwood points ont, assurning that the problem order is greater than the system hardware
order, there are two possible alternatives for increasing system performance:
1. Increase the bandwidth to the array. This is technologically constrained by the limits on
how fast data can be moved.
2. Increase the execution time of the processing elements to allow a greater number of pro-
cessing elements to operate with the saÌne overall bandwidth. This is not constrained
by tec.hnology, as the processing elenrents can be made arbitralily slow. Slower PEs often
mea,n sma,ller PEs, since serialization of the arithrnetic opet'ation can be used to reduce PE
size at the cost of longer execution time. The limit here is clue to practical considerations
on the physical size of the array, and the constraints of tlie problem size (the size of pro-
cessing array needs to sma,ller than the order of thc problem size to increase performance
by inc.rea,sing arra,-y size. Ideally, the problem size shoulcl be an integral multiple of the
array size).
This is also of signilicance if Ma,rwood's consto,rtt l¡anduid,tl¿ model is used [58]. Here' rathet'
than stalling if the entire atray is not being userl, the array is composed of va,riable speed
components along bands. The computationa,l speed is selected depending on alray utilisation.
A range of technologies can be used, with the faster 'edge' components using more expensive
technology (eg. Galliurn Arsenide) and the components getting progressively slower furthet'from
the array. Ihus, although an 1[ x À- array will be 1[ times faster than a single computational
element running at the same speed, if a scalar calculation is required, the single computational
element will be À- times faster than the array.
5.0.1 MatrixAddressing
The original svstem proposed by Marwood [56] describes the mapping algorithm for an ar-
bitrarv dimensioned mapping, although the implementation shown is for a trvo dimensional
mapping. This is shown in Figure 5.2, below. Ihis work was extended [ti5] to a include up
to four dimensions, rvhic.h rvill allow either a foul dimensioual structure to be ma,pped ott to a
a orìe dirnensional addressing space, or for a two dirnensional, arbitrary-sized structure to be




















Marwood's example mapping is the Alternative lnteger Mapping, although he points out
that the Chinese Rema,inder Theorem a,nd other simple mappings can be implenented on the
proposed arc.hitect ure.
The Alternative Integel Representation is
n - (n1N2 + n2¡l1)N (5.5)
where Iú1 and N2 arernutually plime and ly'1 N2:1y', and (ø)t rneans'amoduloN'.
This is very similar to the conventional mapping used to store a two or more dimensional
matrix in a linear mernory address space, ie
n = ntNz I nz (5.6)
The Chinese remainder theorem is slightly more complicated, althouglt still implementable
on the proposed harclware.
/ ,,,-r\ /.,-r\ \
' = (',t tr, ( rut')r, * n2 ru¡ (N¡r ) *,) * 
(5.7)
If appropriate constanl;s are chosen and a rnodulo capability is available, Equations 5.5 to 5.7
can be irnplernented on a two dimensional diffelence engine. Additionally, an offset base a'ddress
can be included if added after the difference engine. This leads to the following expression:
n = base-ad,d,ress * (rz1A1 I nz\zlq (5.8)
It would be desirable to rernove the need for the divider that is generally used to perform the
rnodulo calculation. This can be done bv noting that at each address calculation, a difference,
A;, is added to the previous address,'preu-addr', which is already'modulo q'' Therefore, if
the difference between'q'and the previous address is greater than A;, the sum of the previous
address and A¿ is correct rnodulo q. Otherwise, it is required that q be subtracted from the
result of the sum. This can be sutnrnaLised as:
ne¡t addr =
This is implemented on N[alwood's difference engine serially, with the initial calculation
being preu-addr I À; followed bv (preu-addr 1A;) - q. The sign of the seconcl calculation is
then used a,s the control bit for the rnultiplexers.
6.O.2 Parallelizing and Expanding the Difference Engine
As Equation 5.9 is a loop required for each iteration containing the previous address, it would
be difficult to pipeline this stage. However, parallelìzing is a relatively simple matter. Rather
tlran calculating'¡treu-addriA;', and then subtracting the modulo'q'frorn this, if L¡-q is also
supplied to the address generator, the sum of 'preu-addrl (L¡-q)'can be calculated in paraliel
with the original sum (of pre.u-addr 1A;). The required result can just be switched using a
multiplexer, depending on the sign of the result of the sum prcu-ûddl I (A¡ - q)' Figure 5.3
shows a parallel implementation of the difference engine. As is obvious from this flgure, the
parallel version of the <lifference engine recprires a,rt extra, (d,imensions - 1) registers over the
serial version, but no extra adders.
Expansion of the nurnber of dirnensions frorn two to four is also a simple matter. This can
be ac.hieved by adding four additional rcgisters and four additional multiplexers.
pret,-arldr I L¡ if q- preu-adtlr - A' > 0
¡trer:-addr * A; - q otherwise
(5.e)
80
Figure 5.3: 4D Addr Gen.
5.0.3 Example Mappings on the Difference Engine
Marwood provides several example address rnappings implernented on his address generatot,
which can also be implemented on the expanded, four dimensional address generatorf56,58].
The required input data for a, selec.ted lnapping and the con'esponding output sequence of
addlesses is presented in Table 5.1.
These same mappings can be calculated on the expanded address generator, as plesented in
Table 5.2. Note that these do not use the third and fourth dimensions of the array, as only the
equivalent ma,ppings of the two climeusional version are present'ed.
5.1 Address Generator Components
5.1.1 Sign Detector
One of the main disa,dvantagc's of signed cligit arithnetic is that the sign of a,tt integer is not
readily apparent, as it is with tu.'o's complement arithmetic. The sign of a signed digit nurnber
is the sign of the most significa,nt non-zero digit, whjch rnay be at any of the 32 positions for a
32-bit nurnbet'.
If each bit were checkerl in tuln, an order O(n) operation, the advantage of the inherent
parallelism of signed digit arithmetic would be lost. An order O(log n) sign detect unit can
be constructed using a binary tree, with each cell transrnitting the sign of tlie larger of two
acijacent digits to the la,yer beneath. I'he cells themselves consist of three multiplexers, which
select whether the moie' or less signiflcant digit will be propagated to the next layer. The select
signal for the multiplexers is the nrzrgnitude bit of the more significant digit. A block diagram
is shown in Figure 5.4.
The equations governing the operation of the cell are:
,S'out: AIt.,S't+lvtl.So (5.10)
Xl[o'.,t : ]lIt + lth'A'Io (5'11)
X[o.,t = X/I.TI; (5.12 )
The least significant eclge cell can be reducecl in size, as the outpttts Mt k I'l'1s are ttot













































































































































































































































































































































































































































F igule 11.5: Sign Detect Unit
5.1.2 Initialisatron
'fhe differencc enginc musl always sta,rt, rvith the first output address beiltg zero. Ant'offset
in this a,ddress is included by rneans of the basc atlrlress register and adclet'. Iu Figure 5.2,
initialisation is a,chievecl by aclcling the value in lcrgisl,c.r' D1 1o the negative of itself. wilh the
result placecl in the '¡klclrc.ss Oul' register'. As the lesrrlt of a numbel added to the tregative
of itself is alrvays zero. the ,lddru',ss Out legister is efi-ectively injtializecl to zero. While this
achieves the desired lesult. there w'ill Lre a nlinor speed penalty clue to the extl'a rnultiplexer in
the critical path (\Iuxl in liigule 5.2). As tht¡ use ol redundant coclecl bina,ry is suggest,ecl for
ther adclers. the clelar. through a rnultiplexer rÌlav ac<'ount fol a¡rploxitnately tetr perceut of the
cycle tirne.
-\n alternative initialisalion option is ar.'ailable rlepen<ìing on t,he technolog¡'used. In the
suggestecl C'IIOS clesign. rlouriro logic'uas usecl irr the final stage of the calculation of thc''¡n'
output of the atlders. The rlornino logic rnultiplexe'r' used ìs sltowu in ligule 5.6. below.
As can be seen seer frorr tlre figule. the input to tl-re inverter is held high until the sìgnal
'phi'is blought irigh. 'l)hi'is an evaluate signal that is clocketl every cvcle drrring operation.
However'.'phi'can also bc conrbined wìth a'run'signal. so tha,t'phi'follows a c.lock cluling
nolrna,l operation. ol else is held lorv lol ilitialisation or if a stall occlll's. 'I.'he combination is a
sinrple AND gate' ie 
¡thi : rtLr.crkr
where clkl is a,clock signal. 'I'heleforc. if tlx: Addre-qs Out register is latchecl while'phi'is low. ie
rrot in evaluatc phasc. the data at the inprrl of t,he Addre:s.g OtLt register will be zero. Therefore,
all tlrat is recluired for initialisation is that the ¡ldtlres.s Oul register(s) r'eceives the latch signal
before the svstenr starts ruutiing.
5.1.3 Dimension Counter
\\¡ithin the adclress generator fra,rnervorl<, lhele is a need fot as nlany counters as thc're are
dinreusions. At least one'of tlrese countc.rs ruusl lte verv fast (1he one of lotvest dimeusion). as
ddddd

























dd d dd d d d d d d d
9 87 6 5 4 3 21 0
8.1
I
Figure 5.6: Domino Logic Exclusive-OR Gate
there will only be a single cycle to calculate which 'delta' to use and switch it in. Therefore, the
speed of signed digit arithmetic is again valuable.
Decrement Cell
Rather than implementing a full adder/latch cell, use can be made of the fact that one input
to the decrementer is already know to be zero in all but the least significant cell. In fact, if
the negative carry input is used to apply the '-1' to the decrementer, all the cells can be made
identìcal, with only one input and the appropriate carry lines'
Let the inputs to the cell be labelled {S;n,M¿n}, and the carry input tte Neg-Inr. The
requirements of signed digit arithmetic are such that the Neg-Ouú signal must be independent
of the Neg-In input signal. This is the only restriction on the cell logic. The resulting cell
behaviour is described in Table 5.3. The figure 'X' denotes a 'don't cate'in the table, and 'A'
indicates a 'pseudo don't care', in whic,h the table entry in which it is placed can be either a '0'
or a '1', but the choice will eflect other table entries.
Table 5.3: Decrement Cell Combinations




























































Sout Mo"rt N egortResultS¡n M;n N"g¡n
I
85
The position {1,1,1} (corresponding to {S¿n, M¿n, N egJn}) requires that N eg-Ozú be set for
this corrrbination of iuputs. As Neg-Ouf can only t{epend on {S;,r,M¿,r}, then position {1,1,0}
rnust also set Neg-Ot¿t. This is the only combination of ,9¿n and M;n for which Neg-Out is set.









Table 5.4: a) 5o6 k Mou¿ b)Neg-Ou't
From these Karnaugh maps, the obvious groupings are:
¡ Sour - M¿n
. Mout: NegJnØ M*n
o Neg-O'ut: Sin.Mi,
A furtlrer simplification can be macle if ìt is notecl that the enclusi'ue-nor operator is just as
easy to irnplernent as the euclusiue-or, and tha,t, as an r¿nd operator is formecl using a nu,nd' ga'te
ancl an inverter, Ihe nantl operator is sirnpler than the u,ndoperaior. Therefore,if Neg zú is
propagated between cells instead of Neg-Or+1, then the gate count ancl propagation delay can
be reduced by one. The new ecluations are:
¡ ,Ioú -- Mi^
. Mot,t -- NegJnû M¡n
o N eg-Ou,t : S;n.lli[¡r,
The flow diagrarn representing the decrementer is shown in Figure 5.7.
'Ihe decrernent cell is shown in Figure 5.8. This cell includes the iteration loop latches, the
input multiplexers ancl the arithmetic. logic.
To enable enough time to a,chieve the zero detection ancl the appropriate switching of the
'clelta,' multiplexers, the value storetl in the 'zeroth' counter can be decrementecl before it ìs storecl
(run once tirrougir the decrernenter). This requires an extra pair of multiplexers ancl an extra
pair of latches. The extra latches are rerluirecl to store the 'new' decremented starting value,
whìle the multiplexers select whether the original value or the decrelnentecl value is loaderl into
the new la,thes. These multiplexers are only used a,t initilisation. The only other c.onsideration
for constructing a full digit cell is the output. Rather than routing wires from the iteration loop
latches, it was decided that another pair of latches at the output woulcl consurne minimal area
and save routing. Therefore, these latches have been included in the full cligìt c'e11, although
they c.ould just a,s easily be left ott. The layout of the full digit cell is shown in Figure 5'9,












Figure 5.7: Flow Diagram for Decrement Cell













Figure 5.9: Full Decrement Cell
(RE)Loading the Decrement Counters
The decrementers rnust be loaded initially with the correct dimension, and then reloaded with the
original dimension minus one eveïy time they reach zero, until all decrementers have counted
down to zero. To control this, a reset signal is used. The only input control signal to the
decrementer is the 'Ioad' signafi with the sequencìng controlled internally and also using the
'Iütch'signal. When the'zero'signal is asserted, a'reset'signal is asserted, and the starting
value (original value minus one) reloaded into the counters.
Zero Detect Cell
One of the recluirements of the redundant set proposed by Avizienis [3] was that there is a
unique representation for the number zero. Although the signed magnitude notation described
here does not have a strictly unique representation for zero, thete is a unique combination for
the magnitude bits (they are all zero), and it is only the sign bits that change for various zero
representations. Thus, to determine if a signed digit numbet is zero it is necessary to determine
if each and every magnitude bit is zero2
The approach taken to detect if the decrementer value was zero was a simple NAND-NOR
tree structure, which is a logical eqrivalent to an OR-OR structure for zero detection. For
a 32-bit zero detection circuit, Iog232: 5 levels of binary zero detection are needed, which
corresponds to three NOR-gate ievels and two NAND-gate ievels. Therefore ) a zero at the input
will be indicated by a one at the output.
The VLSI structure of the zero detection cell is shown in Figure 5.10. This is a four input
cell, constructetl of two NOR gates and one NAND gate. The pyramidical structure is cJearly
shown, although in practice, a large zero cletection cell can be compressed into a single linear
array of NAND and NOR gates. This is shown later for the case of a sixteen bit decrementer.
2This is the same a-s for a standard binary representation.
I
B8
Figure 5.10: Four Input Zero Detection Cell
6.L.4 Complete Decrement Units
To test the design of the d.ecrement cells, a four bit and a sixteen bit decrementer were designed
and simuìated. The four bit decrementer cell can count down from a rnaúmum of fifteen and the
sixteen bit decrementer can count down from 65535. The VLSI layout of the four bit decrernenter
is shown in Figure 5.11. Without compaction, the four bit decrementet consunles an area of
0.277ntrn x 0.246mm: 0.068mm2.
The testing of the decrementer is very simple, as it only needs to be tested for the case when
the largest possible input is applied, as the decrernenter will pass through all other states' The
output from HSpice is shown in Figure 5.12. It can be seen from this that the decrementer
initially decrernents the input from fifteen to fourteen, and then proceerls to decrement the
output every clock pulse untii the output is zeto, at a clock rate of 200MHz. When the zero flag
is asserted, the decrementer is reloaded with the stored value, in this case fourteen. However,
the decrementer has another cycle before the stored value ìs loaded (the cycle in which zero is
a,sserted), so the output actually goes to minus one. The next output is the store<l value minus
one, in this c.ase thirteen, and then the cycle repeats.
The sixteen bit decrementer is shown in Figure 5.13. The four leveis of zero detection
ate compressed into a single layer, on the bottom of the cell. The complete cell consumes
0.245mm x 0.886mm: 0.217mm2 in 0.7¡tm es2 CMOS. Most of the area is consumed with
initialisation and reset hardware, with the actual decrement cell area consuming approximately
one third of the total.
Full testing of the sixteen bit decrementer is not possible, as a decrement from 65535 at
flve nanoseconds per count would require a simulation time of 65535 x 5 x 10-e :327.68p5,
which, apart from the huge amount of time required for such a simulation, would consllme more
storage than we have available. However, a decrement from fifteen was rtln) which shows the
reset and reload functions, and also the zero detection. As signed digit arithmetic is word length
independent, adding more signif,cant digits will not effect the speed at which the decrementer
89
Figure 5.11: Four-Bit Decrement lJnit




































FULL.DEC -+ , ILt¿
165t_
F ULL -OEC -9Z-OUI
0t
LN
Figure 5.12: Simulation Results of Four Bit Decrement Cell
90
Figure 5.13: Sixteen-Bit Decrement Unit
operates. These results are shown in Figure 5.14. All output magnitudes not shown ate zero.
As can be seen from this figure, the decrementer function correctly.
5.1.5 Multiplexer Selection
The multiplexers that are used to choose between the 'deltas' must be selected in a hierarchical
manner. The delta that is chosen is the one with the smallest index for which the corresponding
counter output is not zero. Therefore, if 'Se]A' and 'SelB' are the two multiplexer controls as
deflned in the Address Generator Section (Section 5.2),, zero| . . . zero7 are the outputs of the
four zero detectors, a,nd. mur|, mut7, muæ| a¡e temporary variables denoting
o select d,elta\l I
¡ select deltaLf 3
r select d,eltø(O,I) I Q,,2)
respectivelg then the multiplexer selection equations become
o mut?: Z€ro0 . zerol
o murl : zero2
o mur0: zero0
The schernatic of the multiplexers showing both sets of select signals (mun}. . . 3 and SeIA
& SefB) is given in Figure 5.15.
91

































FUr l oef, 16
il-0ult
59+






















To convert the separate multiplexer control signals to the previously defined SelA and SelB,
it is obvious that SelB is simply equivalent to murT. SelA is a combination of all the mux
variables, and can be written as
SelA = mur2.zeroï * mur2.zero2
: zero\.zeroL.zeroï * zero0.zerol'.zero2
zeroÙ I z er ol). z er o0 I z er o0. z er ol. z er o2
z er o0. z er o0 I z er ol. z er o0 I z er o0. z er ol. z er o2
zer oI. zer o0 i z er o0. zer oL z er o2 (5.13)
5.2 Implementation
The Address Generator was implernented in 0.7¡rm es2 CMOS, using two signed digit adders
with multiplexers, one sign detection unit, four dimension counters and four difference-register
pairs. As only initial design investigation and verification was undertaken, an eight bit address
generator was implemented, although extension to 32 bits is a simple matter. The propagation
free properties of sign-digit adders means that a 32-bit adder takes the same computation time
as an eight bit adder, and the sign detection unit will require another two multiplexers. As a
single adder and the sign detection unit comprise the critical path, it is felt that the extension
to 32-bits will require of the order of a single nanosecond extra per cycle.
The VLSI layout is of a sirnilar format to that of the schematic in Figure 5.3. The signed
digit adders with multiplexers are laid out one above the other, with the adder with the 'delta'
offsets, Adderl, on top and the adder with the 'delta-Q' offsets, Adder2, on the bottom. The
sign detection unit in placed below the bottom adder, Adder2. The outputs from both the adder
units are directed upwards towards the output latches, after which they can be multiplexed to
select the correct output. The latches that contain the differences for the address generation
are on the top right of each adder, and are loaded via an input bus that runs from left to right
above the difference latches. The latches feed the adders via three 2-1 multiplexers configured
as 4-1 multiplexers. A series of control buses run between the difference latches and the adders.
These are fed with control signals generated by the dimension counters and the sign detect unit.
The VLSI layout of an eight bit address generator is shown in Figure 5'16
Figure 5.16: VLSI Layout of an Eight-bit Address Generator
93
The eight bit address generator consurnes approximately 0.7956m1n x 2.l1mm: 1.68rnm2
of silicon when implemented in 0.7¡lm es2 CMOS, and uses 41183 transistors. The separation
betrveen cells is approxirnately 0.2569 mnt. so a 32-bit irnplementation would consume approxi-
mately l.68mm2 +24x 0.2569mm x 0.7956mm:6.59mm2 of silicon. There are 491 tlansistors
in each digit cell plus 24 for each output latch set, so the 32-bit irnplementation worrld c.ontain
approximatelv (491 + 24) x 24 + 4383 : 16743 transistors.
The eight bit address generator was sirnula,tecl for speeds up to 100 MHz, using HSpice. The
results of the simulations are shown in Figures 5.77 to l-r.19 for a va,riety of input data. The
input clata used demonstrate selected cases of the exa,mtrlles presenterl in Section 5.0.3 and Table
5.2.
Figures 5.I7a) to d) show the r:ase of a,ccessing a matrix in normal form. The input data,





'Ihe tabulated results are shown in Table 5.5. This provicles the 'f)'output fron Adderi alnd
the 'Q'output ftom Adder2, as well as the 'Neg'output from the sign detector unit.








These simulation plots have been tabulatecl in 'I'ablc 5.6, showing the output 'D' data from
Adderl, the output 'Q'data fton Adder9 and the Sign Detector output 'N"g'.








These simulation plots have beeu ta,bulatecl in lable 5.7, showing the output 'D' data from
Adderl, tlre output 'Q' clata frotn Adder2 and the Sign Detector output 'N"g'. Note that output
address is from the'D'digits if 'Neg'is high, and from the'Q'digits if 'Neg'is low.
Other rnappings such as submatrix extraction and circulant addressing wcrc simulated, and
found to operate correc.tly. It is felt that, as the alithmetic is propagation free, the address
generatol woulcl operate correctly when extencled to 32 bits, especially with the move to the
0.5pm CMOS process that will soon be available to the Department.
94
r HçPICE F]LI TRIATEO FOR CIRCUIT ADTR,PCSI.l 8
95/0¿/t3 l6r L8r t5
r HqPICE F LI TRTAIEO FOR CIFCU f ADDP PCSH_8
95/íl¿/13 l6 1È'15
t H9PII-E FILE CRTATIO FûF I]IR(dtI ADOR-PLSM-895/0¿/13 L6rlErll
r , ,, ' , lt0û 0N 150
TIHE (LIN]
- ADDR PCSI4 BI- HDO_OUT
_ VL- 600_ùur 0 IE--- L N
T




f HEPII E FILI I-RIAfED FOR CTRCIIII ADDR PCS14_935/0¿/13 l6rl8'15
,,,1,,,,1r1,,1
N r00,0N 150 0N ¿00ttHt cLlN)























































--t atioR PC6M I I.l - Mo3 ouT










































Figure 5.17: c) Q0 to Q3 d) Q4 to Q7 & Neg


















































































































































































































































































































Table 5.5: Resultant Addresses for Normal Addressing Data
96
¡ HSPILI FILt úRTAIII] FOR CIRLUI I ADOF'-PI5I,1 8
9q/Iû/t8 t'1 04r3¿
¡ HSPLLE fILT CRTAII! FOR CIFCUII ADDR_PISH 8
9q/10/18 l¡l'04r?¿
I


























.E AnnR P..M Ê
1 - Hor our+
--
sD0 0urtts
- E0 I -0LlT












































































Figure 5.18: a) DO to D3 b) D4 to D7
I HSptCE FiLt CR¡AIE¡ F¡R ttRtUIr A¡UR_prsH_s r HSPICE FILE CREAIEÛ FOR CIRç ll AD0R ptSH I--- sq7Lotlg iqrõr:¡ã- - I'l/10/18 L'l'Û4132































^-- stq ùur-f} \¡
0
T I 50 0
,,,r ,L'l'i
100 0N 150 0NTIMI ILIN]
Figure 5.18: c) QO to Q3 d) Q4 to Q7 and Neg Out















































































































































































































































































































Table 5.6: Result of Simulation of Addressing for Transposed Access
98






















































































H I l_,lui,l "- ---- -!ll ltrf
G
¿P- /]N
Figure 5.19: a) DO to D3 b) D4 to D7





























¿00 ON ¿0 0¡t
) IN
Þ'igure 5.19: c) Q0 to Q3 d) Q4 to Q7 and Neg Out
























































































































































































































































































































5.3 After Calculating the Offset
Once the offset has beel calculated, it must be converted to an unsigned binarv number'
have the base address adde<I to it. As the address stream is directed at fetching data and there
are lto ilstruction fetches during the computation of a matrix producted, there will be no non-
cleterministic branches or jumps in the address stream. Therefore, the conversion to unsigned
binary and the addition of a, base a,ddress can be heavily pipelined operations. However, if a
sigled cligit adder is used to add the base offset, this will only take a single cycle, and the
co¡version to binary form can take place after the base offset has been included.
One more addition to Marwood's difference engine that is of use is an increase in the number
of base offsets that can be selected to be added. The Ìeason for this is for the direct solution
of Gauss-Jordan elimination. One base register will hold the next pivot block starting (base)
address, one will hold the row normalisation starting address, one will hold the update phase
starting ad.dress and the fourth will hotd the starting address of the next phase. trach starting
(base) id.lr"r, is actually calculated in the previous phase, so no stoLage or extra calculation will
be required, merely a four way base-select multiplexer and feedback path from the base address
to the base offset registers.








Irigure 5.20: Extended Address Generator
101
5.4 Conclusion
In this chapter, Marwood's differenc.e engine was extended to four dimensions and implemented
using signed-digit techniques. The inclusion of the third and fourth dimensions allow the differ-
ence engine to directly compute the addresses require<l for partitioning an arbitrary sized matrix
on a frxed sized array without external intervention. The use of signed-digit arithmetic for this
critical component, together with the new parallel conflguration, allows the system to produce





The mernory subsystern for the matrix engine must be capable of supplying two operands atld
rec.eiving one auswel every cyc1e, whic.h fronr Chapter 5 was suggested to be approximately ten
nanoseconds. Such a high data request and retirement rate is typical of vector supercomputers,
for which elaborate memory systems have already been designed [89, 77] elc.. These typically
exploit vector and matrix structures by irnplementing heavily interleaved memory systems tha't
rely on the assumption that a fetched matrix will have logically ad.jacent matrix elements in
physicallv adjacent memory locations. Unfortunately, in many embedded applications, such as
the Discrete Fourier Transform with prime factor mapping, this is not the case, as the applied
matrices will be created from selected data within a larger amount of data. Therefore, a system
that can suppot't both the 'standarcl' access patterns typically used in vector systems and the
'non-standard' access pa,tterns used for prime factor rnapping, transposed mappings, etc is re-
quired. This chapter will describe such a systern, based on a four-bank bank-switching cached
system, using standard components except for a custom bank-switching chip.
6.2 Applying and Extracting the Outer Product
As the onl.y function perforrned by the arra,y is the outer ploduct, with other fuuc.tiotts handled
by the data controllers, a brief discussion of how the outer product is extracted is useful.
6.2.L Extracting the Outer Product
When the array is unloaded (the outel product is extlacted), each a,ccumulator in the arrav
moves its data into a latch. The data is then shifted serially a,long a row or column (depentiing
on whether storage in row- or column- major is recluired) until it reaches the array boundary,
where it is unloaded in parallel form and then serially retired to memory. An extract from the
processing array showing the unloarling hardware is shown in Figure 6.1'
Note that the transpose of a ma,tlix stored in row-major fcrrm is the same matrix stored in
column-major form, and vice-versa. Therefore, if the transpose of a matrix is requiredl , the
matrix can simply be extractecl in the opposite form from usual.
1As is notecl in Chapter 3, the transpose is often rec¡rired due to the cornmon use of the matrix relationship
Á.?8" : (BA)T t,o achieve thc correct ordering.
103
ln
Figure 6.1: Extraction flardware
6.3 Memory Overview
Although the mernory bandwidth requirement for the matlix engine has been reduced by a factor
of Pfor a P x P array compare{ to a MIMD2 architecture with a similar number of processors,
the memory bandwidth required to sustain the high speed possible is still very la,rge'
The sort of memory access bandwidths required for the ntatrix engine are reminiscent of
those of vector supercomputers such as the CRAY 2, Y-MP, Fujitsu VP200 and the NEC SX-2'
in which data ac.c.esses are required a,t rates of the order of eight to sixteen nanoseconds per word
[81,80,9]. These provide the data at the required rate by allowing only load and store accesses
between memory and a number of 'vector registers', while the vector computational units operate
only on .lata from the vector registers. This type of a,rchitecture is known a's a uector-register
architecture, as distinct frorn a rne.mory-nt,eTnory architecture in which all data operands are
loaded from and stored to memor.v. Aclditionally, the only ma,jor vector architec.ture that routes
data through a cache is the IBN{ 3090VF [35].
The mernory system that was designecl is a combination of uector- register and mentory-
nùenrory architectures. The a,dvanta,ges of recycling data, through the array inclica'tes that a
large data cache is desirablc'.
The cornplete memory svstern can be considered in one of two waYS' either
¡ a, cached metnory-Tnemory system, in wliich the accesses are a,ll to ancl from a cache whiclt
is supplied from a large ma,in lnemoÌy, or
o a uector-register systern that uses static RAM rnodules to construct a very large register
bank for which cach vcctor has only a single place to which it can be allocaterl.
6.3.L Difference trYom Vector Memory Systems
Although the array ûìernory systern has many similarities to those comrnonly fortrrd in a vector
nìemoly system, there are some clifferences. These can be summarised as:
¡ Non-uniform strides.




has a, stride of one, or a unit-.stride, wltereas the pattern
0,5,10,15,...
has a stride of five. 'Ihe access pa,tterns of a vector architecture are greatly simplified by
the assumption that all of the two loads and one store have the same stride. Therefore, a
large, heavily-interleaved single llìemory can service all memory pipes' as shown below in
Figure 6.2 for an interleaving factor of eight3.
Plpêo
Figure 6.2: Vector Memory Architecture.
With referelce to Figure 6.2,if rnemoly pipe 0 a,ccesses melnory bank M0, pipel accesses
bank M2 and pipe2 accesses bank 4, then in the next access cycle, a unit stricle access
will cause pipe 0 to access bank Ml, pip€ 1 will access bank M3 ancl pipe2 will access
ba¡¡k LI1. If suitable clelays are introduced to pipes I and 2, then arbitrary offsets in
the access addresses can be included, while non-unit strides require only that the memory
access tirne combinecl with the interleaving fa,ctor is less thatr the ALU cycle time. Thus,
a stride of two in the system in Figure 6.2 recluires that the ALU cycle tirne is greater
than or equal to the memory cycle time (every second bank is skipped), whereas a stride
of three reqlires that the ALtl cycle tirne is greater than or equa,l to half the memory cycle
time (all banks are accessed).
However. if the accesses have strides tl'Lat are not all the same (non-uniform strides)' then
rnenìory bank conflicts occur. For example, ìf pipe 0 has a stride of one starting at bank
M0, and pipe t has a stride of two starting at bank M2,th.e bank access patterns will be
those shown in Figure 6.3
The circled access of bank 6 shows the conflict that occurs when pipe 1 'catches up' to
pipe 0.
Although great care can be taken with numerical computations to ensure that the strides
of all accesses are the sarne, there are other cases such as the DFT using prirne factor
mapping th¿t require the st,rides to be non-uniform. Therefore, a memory system capa'ble





















Figule 6.ll: Non-uniform Stride Conflict.
¡ Predictable Address Seqrtences
Although vector rnernolies are typically heavily pipelined [89, 80], they are also subject to
the affects of instruction branch operations. 'lherefore, if the memory sYstem is too deep,
long delays may result in the event of a branch or jump instruction. I{owever, as the matrix
opera,tion performs a complete rnatrix operation (multiplic.ation, Gaussian elimination etc)
with o1e instruction, no branc.h instructions will occur cluring the matrìx operation, so a
relatively large number of addresses can be precalculated before being sent to memory,
thus preparing the system in advance to select th.e correct memoty bank, check tags' etc.
6.4 Cache
With the decision to use bank swapping to move data quickly between the output and inputs,
another problem presenterl itself. If the main memory is swapped, the da,ta in the data controller
caches woulcl need to be flushed before either the memory bank or the cache can be read or
written to. The simple solution to this is to attach the cache to the rnemory bank rather thau
the data controller, and rnove the cache with the memory.
The questions that arose in the rlesign of the cache inclucled:
¡ How lalge to build the ca,che?
¡ What should the block size of the cache be?
¡ What a,ssociativity should be used?
o Hou'soon aft,er a miss can t,he next addt'ess be accessed?
¡ How is a'write' dealt with?
These questions can have clifferent arìswers, clepending on the a,pplicatiott run
answers to the questions are suggested below.
Howevet,
6.4.L Cache Size
The answer to this is the sarne a,s for any other cache system - as large as possible while still
remaining cost effective. llowever, a cache that is to small may be more hinderence than use,
as the tagging of cache blocks may cause an aclclitional ovcrhea,d with no added benefit.
6.4.2 Block Size
Inc.reasing the block size atternpts to lerluce the block fetch pt'ttaltv bv a'mortising it over a'
larger number of accesses.
If the miss penalty is P, the refill tim.e pel word from mernory is irl, the block contains B














If the block size is increascd such that (B - 1X11+ À) >> P then thc average access time
per word approaches (H + R). This suggests that, a very large block size is used. However, if
not all of the tllock is used by the processor', the unused a,ccesses frorn memoly represent wastecl
cycles. If there are W words in the block that are not ler¡uiled before the block is overwritten,
then the time to access all the required words is:
Time2 : P +(,8- 1)A+(B-W -L)H
: P+(B-1X,?+H)-W.H (6.3)





Therefore, if W is close to zero, Ðquation 6.4 approxirnates Ðquation 6.2.
becomes large (tr4l --* B), the average access time in Ðquation 6.4 becomes
However, as W
au. Time|- P+(B-1)A
which tends toward infinitya. Additionall.y, the lalger the block size, the fewer blocks can be
kept in a paltic.ular sized cache. Therefole, a block size that is too large potetrtially can rerluce
the hit rate of the cache if only parts of each block are recluired.
Obviously, the idea is to use a block size that is as close as pcssible to the number of
words used by the processor'. Ihe optirnurn size of a block, therefore, depends not only on the
algorithm being executecl, but also on problem specific coltsiderations such a,s problem size, data
configuration (the access pattern requiled to extract the data),etc. As the address pipeline can
be quite deep, it is feasible to shift the internal cache tag by an arbitrary amount before tag
comparison. Therefore, the block size ca,n be chosen clynantically from a lange of values from
(say) 16 to 1024 at initialisation, depending on the hardware included to support a particula,r
block size.
6.4.3 Associativity
Cache associa,tivity is the number of positions in a cache that a block can be placed when it is
loaded into the cache. The choices are:
o A single position. Each block from main memory may be present in only one place in
the cache. Tliis is known as a Direct map¡tetl cache. Generally, the block is mapped into
the cache block that is the (rnenory block address) modulo (the number of blocks in the
cache). flhe advantage of this schem.e is that only a, single tag location needs to be checked
to see if a block is present in the c.ache.
aOf cour.", the zero on the denorninator is only present for the casc of n : W, ie ltone of the words in the
cache block loacl are used - clearly a ridiculous case
rc7
. Any positioltr known as F\illy As.sr.tciatiur:. Therefole, a, supposedly'fair'block allocation
scheure would fill all empty blocks in the cache before overwriting an existing cache block.
However', as the block may resi<le in a,ny loca,tion in the cache, or no1. at all, a,ll cache tags
rnust be simultaneously check fol the plesence of a block.
o Any of a selected set of positions. Known as ,5'eú Associ,ati,ue, this scheme is a mixture of
Direc.t Mapping ancl Fully Ässocia,tive, in tha,t the set of locations in which a, block may
be placed is fixed (direct rnapped). but any location within the set tnay be used to store
the block (fully associative).
Generally, set associative cache provide a better hit rates than direct rnapped caches, with-
out the rnassive cornplexity demanclecl by fullv associative caches [lS] . Howeveî) as the caches
for our system will generally be a,ddlessing rnatrices with very regular ac.cess patterns, set as-
sociative caches lose sone of theil appea,l, as a set associative cache using a least rccently used
replacernent (LRU) polic.y is no better than a direct mapped policv for addressing a well stored
partitioned matrix. In fact, sirnple sir¡rulations indicate that, coupled with the added overhead
due to the increased complexity of a, set associa,tive cache, a direct rnapped cache provid,es better
perforrnance tha,n a set a,ssociative cache. 'l'hese simulations nrodelled ma,trix address pa,tterns
and maintained cache access data to monitor the frequency of cache hits and misses and their
effect on systcm perforrnance. Fol this reason, a direct mapped cache is recommended.
6.4.4 Latency After a Miss
Ihe section on block size (Section 6.,t.2) a,ssuned that if a cache miss occurred, theu the block
in which the missed word belongs was letched froln rnemory, the missed word was supplied, and
the remainer of the bloc.k also written to cache nenìoLy. Any subsequent rea,tls to the one tha't
c.ausecl the cache rniss wele stalled until the block refill had fìnished. A rnajor saving c.an be
made l-Lere using 'data streaming' for accessing clata as it arrives drrring a reflll. espec.ially if large
block sizes are used.
The iclea of 'strearuing' is verv sirnple. a,nd nrosl advauced rnicroprocessors use sorne form
of instruction strearning. The concept behintl instrucliou stleaming is that it is very likelv that
instructions subsec¡rent t,o lhe instl'uction that c.ausecl the caclte miss and refill will also be in
the fetched block, and in fa,ct be in adjacent locations to the instruction fetched. lllherefore.
considerable tirne can be saved b¡- monitoring the adclresses within the block as they are loaded
into the cache, and usingl the clata as it is loacled. r'ather tha,n wa,iting for the refill to colnplete
before the next instruction is fetched.
As the matrix engine implements a complete matrix operation such as matrix multiplication
or Cìauss-Jordal elimination with a, single instruction, the advantage of instruction streaming is
not of rnuch use. Howeverr, rlue to the large amount of data, that is movecl from main memorY
into ca,che menìory in the order usecl by the nratrix engine, sonre fornr of data strea,ming rvould
provide a great improvernent in perfot'rnance.
As the natrix adclress access pattern is well definerl, and a well ordered matrix extraction will
typically contain accesses to large arnounts of data stored in successive locatiotts in memory, the
data written into cache rnemory during a block refill is typically also rec¡tirecl for a future access.
Therefore, data streaming wilì allow the <1a'ta to be l'eacl directlr¡ irlto the arraS' simultaneously
to the cache refill, thus potentially saving the'block size minus one'cache accesses that mav be
required if streaning is not implementerì.
108
6.4.5 'Write Policy
The question of which write policv to use ultimately pt'ovitles little choice. Of the two main
pohciãs, those of write-through and write-back, the latter is the only suitable approach' It onlv
rernains to explain why this ìs the case, and to c.onsicler rleta,ils.
To see why a write-through policy is not feasible, it is necessa,ry to consider its place in the
matrix plocessot system ancl also the reason why it works in a' conventiona'l system'
On a store, a write-through ca<:he writes the data to both the cache ¿nd ma'in tnernory'
Therefore, main memorv always has an updated copy of the elements in the c-a'c'he' Write-
through succeetls because the perc.entage of tirne spertt performing writes is small, and the cache
acts in many ways like u lr.rgl buffer a's far as rvrites a,re c.otlcetned' Iloweiver, if a very large
number of successive writes occur, main memory can not keep up with the cache, and the
processor will 'write-stall' until sufficient writes have been retired to main memory. This is just
the ca,se with the output da,ta controller for the matrix proccssor - there are no opera'tions buf
write (store) operatioirs. Therefore, the tnain tnentory will continue to write-stall, and thus stall
the entire engine for matched read-write operations.
A write-back cache sufiers in part frortr a similar problem, but to nowhere near the same
extent. However, many out-of-order or scattered operations can be handled in a much better
manner using a write-back ca,che, and the main rnemoly can suppor'l; the block writes typical of
write-back caches much rnore efficiently (see Section 6.5).
6.4.6 Implementation
The c.or¡plete system is basically c.omprised of three identical Da,ta Controller/Memorv pairs


















Figure 6.4: Data Controller / Cache Memory pair
The latches can be implemented using stanclarcl 74ASXX or 74VHC1XX conrponents. Octal
latches such as Motorola's 7445373 have maximum propagatiott tirnes of approximately 5'0
naloseco¡cls, whic[ leaves another 5.0 nauoseconcls for propa,gation between la,tcltes in a 100
MHz system, which is ample for a well designed MCM or high speed systent'
109
Although not strictly 'off-the-shelf', there was still an attempt to tlse readily available high
speed Static RAM modules. The Static RAM (SIìAM) modules are available from ma,nufacturers
such as Integrated Device Technologies (IDT), Electronic Design Inc. (trDI) a'rtd Motorola. None
of the SR.AM modules available in the data sheets [68, 40] come in a latching version with assess
times of tel nanoseconds or less. Therefore, the cache memory will need to be 2-way interleavecl,
although it is only a matter of time until the timing constraints will allow a single bank of memory
with cycle times operating at 100MHz.
The suggested SRAM module is the IDT7M4077, rnanufactured by Integraterl l)evice Tech-
nologies,, or sirnilar'. This is a,256k x 32 mernory subsystem, with a read and write cycle times
of fifteen nanoseconds. As 2-way interleaving is required, the minimum size of a cache will be
5I2 x 32bits = 2Mbytes. For a one megaword cache, two interleaved banks will be required.
Therefore, as there are four rnemory ports, each with approximately four SRAM modules, sixteen
meÍìory modules will be needecl in total.
To determine the cache tagging requirements, the cache size, address range and cache block
sizemust be known. For convenience, it can be assumed that each data port can potentially
access 23o x 1.07 x 10e locations. The block size is determined at run-time but is in the range
16to1024 words, denoted by'B'. The cache size is implementation dependent, but will contain
approximately one rnillion locations, denoted by 'C'. The depth of the tag SRAM is equal to the
maximum number of blocks that can be contained in the cache SRAM. However, is the cache
SRAMs are interleaved and the tag SRAM is not fast enough to be read for every access, then
the tag RAM will need to be duplicated by the caclte SRAM interleaving factor, '-I'. The length
of a tag word is the total number of addressable locations divided by the number of locations
in the cache. I/sing these criteria, the tag lequirements can be calculated, and are presented in









































































Table 6.1: Tag Requirements
110
6.5 Main Memory
It is anticipated that the rnatrix engine will be use<l on very la,rgc problems in the fields of
aerodynarnics and fluid flow, among others, which involves the solution of very large, dense
rnatrices. Typically, none of the n'ratrix r:lements will be zero. Practical problerns require matrix
sizes of the order of hundreds to thousands of elements in a row or colurnn (- 2OO x 200 ---+
10,000 x 10,000), resulting in storage requirements of 40kWords - 700MWords which equates
to I60kbytes - 4\\Mbytes for single precision or 32\kbytes - \\\Mbytes for double precision.
Added to this the bank-swapping nature of the equation solving routines (Gaussian elimination
- see Section 3.4) which at least doubles the memory requirements, and the need for verY large
arnounts of maitr memoly becomes apparent.
To keep the cost of the nrain mernory to a rea,sonable a,mount, Dynamic. Random Acc.ess
Memory (DRAMs) should be used. DRAMs have the advantage over Static Random Access
Memory (SR,AMs) that they provide a much denser memory system at a cheaper price. This
is paid for by the slower access times for a DRAM systern and by the added complexity due to
the multiplexed nature of a DRAM chip.
In an attempt to match cache operations to rnain nìernory, it was decided to concentrate
mainly on block data transfers. While this rnay not be the ideal for embedded systems engaged in
high-speed Digital Signal Processing (DSP) or large control systems (Kalman filtering, Neural
networks), such systerns typically will not use a c.ache/DRAM hierarchy, operating instead
entirely out of the static R,AM used for the cache. In such a system, the Tag checking can
simply be turned 'off', and the main memory comprising of dynamic R,AMs ornitted. Thus, the
rnain rnernory design is intended only for a nurnerical system solver type architecture'
The majority of DRAM chips available in the market plac.e toclay are of an asynchronous
nature. These require that the system pla,ce the upper ha,lf of the acldress on the input address
pins, after whicli an address strobe is asserted. Thcn thelower half of the adrlress is place on the
address pins and a second address strobe is asserted. The internal structure of a lM x 4 static
colurnn DRAM chip is shown in F'igure 6.5 [68]. I'he static column refers to the fact that, once a
particular row of the rnernory array has been selected using the 873'sigrlal and the address pins
A0...49, any column of the selected row ca,n be accessed by changing gnly the input address
pins; ie no colurnu address strobe (CAS) signal is required.
The clata sheets provided b1, l\Iotorola give the time to lead frorn a 111,1 i 4 static colurnn
DRAX{ to be 60 nanoseconds for the first lea,d anrl 35 nanosecond for aìy srtccessive read
that falls within the sa,rne DRAM page. Therefore, for a ten nanosecond cYcle time, the ma,in
rnelnory needs to be four words wide, or use an interleaving factor of four. With an interleaviug
factor of four, a 32-bit wor<i and four memoly banks, the fundamental width of rnemory will
be 43 = 64 bytes. If these ale composerl of lM x 4 DRAM chips, the minimum mernory size
will be 64Mbytes of rnain memory. Although this may appear to be a large amount, it is small
compared to other systems of comparable performance.
Recalling that the cache SRAM is two way interleavecl, the main memory can be considered
to be a 2-way 64-bit wide memory rather tha,n a 4-way 32-bit wide memory. This eases the data
switching requirements at the main memory to cache interfa,ce, as less data needs to be pushecl
on to a sma,ller bus.
6.5.1 Implementation
As there is a very lalge pipeline for the ca,che rnemot'y system. the row adclress ca,n be rea'dy at
the address inputs of the DRAIvI when a miss is signalled. Therefore, the row adrlress strobe




































A block diagram of the DRAM corìfìguration is shown in Figure 6.6. The input address
is partially applied to rnultiplexer MuxO? aud then loaded into the 4-bit synchronous countel's.
which a¡e used as block countels. In this way, an arbitra,ry sized block starting address can
be create{ from any address within the block. The starting acldress is passed to the second
multiplexer, Mux1, which separates the addless into row and column addresses' the higher
address bits fìrst. Note that the low address (uppel address bits) can be transrnittecl thlough
multiplexer Muxl while the bloc.k stalting aclclress is being calculatecl and loacled into the block
counters. The row acldress is then latchecl into latches 0 to 3, and applied to all the DRA\'Í
a,rrays simultaneously. 'fhe RA,s'0 to RASS lines ca,n then all be asserted.
Once the row aclclress has bee¡ appliecl to the DIìAM arrays, tlte column a,ddress (lorver
ad{ress bits) can be applied to latches 0 to lì via rnultiplexer Muxl) arìd tlìen tlte chip select
sigrrals (C-50-S) asserted5. Iror all successive column addresses (block addresses), the latches and
chip select siguals are used in turtt.
If the memory operation is a rnernory reacl, thcn the data output from the DRAIVIs is passerl
to latches 6 & I via multiplexels Mux2 & ìVfuxlÌ, an<l then to the SRAMs in the caclìe. As the
cache SRAMs are 2-wa,yinterleavecl (Section (ì.4), then DRAIVI banks 0 & 2 feed SRAM bank 0,
and DRAM balks 1 & 3 feecl SIìAM bank 1. lVhile data is being applied to the output latches.
it is also applied to the 74AS2B0 paritv- checking chips, which will flag a parity error in the event
that there has been a data stora,ge error.
If the memor.y operatiolt is a, rnernory wlite, the data to be written is appüecl to DRAM arraYs
via latchcs,4,5,6.t 7. lrlote that, although thc writc hold time for the MCM54402A, is flfteen
lanoseconcls [68] and the chip select pin (C,$ could be used to control which DR.AM bank data
is written to, the phvsical connection between the DRAM chips woulcl clisrupt Inemory reads.
Therefor.e, separate latches are used. While data is held in the input la'tches, it is also applied
to the 7445280 parity checking chips to gerìelate a parity bit for each byte.
The initial timing of a burst read is shown in Figure 6.7. A burst write is similar, except
that the data flows in the opposite direction.




























Figure 6.6: Block Diagrarn of Four Way Interleaved DRAM Subsystern























Figure 6.7: Initial Timing of Burst Read
113
6.6.2 Synchronous DRAMs
An alternative to using the asynchronous DRAM chips described above is to use the relatively
new synchronous DRAM chips that are just becoming available from manufacturers such as
Samsung Inc. etc. These DRAMs, denoted SDRAM for Synchronous Dynamic R.andom Access
Memory, are similar to a core DRAM chip with sorne extra registers added for timing and burst









c ook cKE n¡s c¡s Úe DQM
Figure 6.8: Synchronous DRAM Internal Block Diagram
The Samsung KM48SH2000-6 is a, synchronous DRAM design conflgured as 2M x Bbits that
can burst frorn the input address at 2-,4-, or 8-bit, or full-page (512-bit)block sizes. It has a CAS'
latelcy of two cycles, and a, maxirnurn clock frec¡uency of 100MHz. After an initial latency, the
SDRAM chip cycles at the clock rate. ie provides a new byte every 10 ns. Once a burst has been
completed, the outputs automatically enter a high impedance state. Additionally, memory chip
precharging c.an comnence two clock cycles before the burst is complete. Therefore, with the
faster versions of the SDRAM that require two cycles of precharge, a new cycle ca,n commence
imrnediately the previous one has finished.
Using the 2M x 8 chips running at 100MHz, no interleaving is required to support ten
nanosecond accesses. Therefore. the minirnum memory increment size per bank is 2M x 32 = 8
Mbytes for a 32-bit word. Then for four memory banks, 4 x 8Mbytes = 32Mbtyes is required,
which is similar to the size lecluired for asynchronous DRAMs. The difference between the
options is the reclucl,iorL irL atltletl ct-rnplexity as the synchronous chips coutain all the necessary
counters, and there is no need for multiplexing. There will also be fewer memory chips using
sylchro¡ous DRAMs, as they have a 16Mbit density instead of a 4Mbit density, although this





























The main rìernory of the systern shoulcl be lalge enough to hold a very large rnatrix without
requiring any access to (slower') Input/Output devices. The main memory can either be c.om-
posed of one segment of thin DIìAM chips, such as 4A,I x 1bit, or several segments of fewer wi<le












F'igure 6.9: Memory Bank Configuration using a) Thin & b) \Mide DRAMS
The structure using thin DRAMs (Figure 6.9a) is much sirnpler to implement as no selection
between bank segments is recluired. However, the confìguration using several segmettts composed
of wide DRAMs has several advantages, clespite the rnore complex control lequiretl. The first,
and most obvious, is the possibilitv to interleave the segments or use a wide bus to supply data at
a higher bandwiclth. As alluded to in Chapter' 5, an inclease in bandwidth can be a,ccompa'niecl
by the square of the increase in cornputa,tiona,l ra,te, depending on implementation. Thus. two
way interleaving can quadruple the computational ra,te,4-way interleaving will increase the rate
by a factor of 16, etc.
The second advantage corles in the expa,usion of systen memory. As rnemory coutrol hard-
ware genera,ll.y requircs equal size banks, the snraller segrnents will allow a smaller increase in
main memory.
The tliird advantage is linked to the cache system, in pa,rl;icular the fact that the cache will
be very la,rge and of a silnilar size to a, rea,listìc segrnent size (eg lMWord). lhis c.a,n be usecl to
a¡ adva,¡tage when cache write-backs a,re recluired. Given that the cache will be direct-mapped.
and assrrrning a, cache size eclual to a segrnent size, then if a block in the cac.he is rlisplaced
an{ the data i¡ the block needs to be w-ritten back to main memory, the incoming data (the
replacernent block) must be coming from a different segment. 'l'herefole, the read frorn one
segment can be initiated while the the write ba,ck to another segment is still completing, as itt
Figure 6.10.
Typically, the access time from the last write for a 7Onsec static column 1Megx4 dynarnic
RAM is approximately 65 nsec[68]. tf the new block is in a different DRAM page. which is
highly likely with the large block sizes available, then a RAS precharge period of approximately
4Onsec will be required. Aclditional delays in the change from read to write will account for
approxirnately 15nsec, for a total delay of 12Onsec.
115
Addr. Reg
Figure 6.10: Initiating a Read Simultaneously to Completing a \Mrite
6.6 Bus Exchangers
The requirement in the memory systern that large banks of memory be swapped verv rapidly
presented a dilficulty, in that standard off-the-shelf CMOS and AS-TTL chips were not fast
enough. As we have access to very fast and rea,sonably priced 0.7 pm and 0.5 ¡rm CMOS
processes via TIMA-CMP in France, it was decided to design our own chips, wliich we designated
'Bus Exc.hangers', shortened to 'tlus-X'.
The Bus Excha,ngels nnrst l¡e able to connect any of the three matrix engine ports or the
external port to any of the four memory subsystems. As there are two input ports and one
output port to the matrix engine, the llus llxchangers need to be bi-directional, although the
directionality of the switches need only be detennined at a melnory swap. A<lclitionally, they
must be able to latch the transmitted data,, a,s well as behaving as a tra,nsparent switch. A ntod-
iflcation of the latch structure in Figure 6.11a) can be used to implement the latch/transparent
features, which rve shall refer to a,s the'driver'cell (shown in t'igure 6.1lb)).
A cont
In }ut ln 0at
Lotuh A cont C conl
S ,rt
Figule 6.11: a) Latch Cell b) Driver Cell
The rlrivels need to push data through foul p-type a,nd four n-type pass tra,nsistors (config-










configured in parallel for each inverter. The transmission gates are paired to provide the two
levels of transmission control while using minimal area. The VLSI layout of the driver cell is
shown in Figure 6.12.
Figure 6.12: VLSI Layout of the Driver/Tlansmission Cell
If it is assumed that a read is data applied.lo'In' and a write is data applied. t'o'Out', where
'1n' comes from an external source and'Out' is from the switches, then there are three possible
modes of operation:
o Latched Read
Data is applied to'In', where upon it is latched as well as being passed through to the
output,'Out'. A-Cont is pulsed high, B-Cont is pulsed low and C-Cont is always high.
o Transparent Read
Similar to a Latched Read, except that a change in the input is reflected by a similar
change in the output, 'Out', delayed by the propagation delay of the device. A-Cont is
always high, B-Cont is low and C-Cont is high.
¡ \4/rite
Data passes directly ftom'Out' to'In'. A-Cont is high, B-Cont is high and C-Cont is low'
6.6.1 Control
The bus exchangers must produce some internal control signals from the input control signals.
The inputs to the BusX units for control are:
c rlw
Read or write control signal. This signal controls the direction of data flow, either from
one of four inputs to the single output (multiplexer) or from one input to one of four
outputs (dernultiplexer).
o aïral




The control signaling whether the mux/demux is operating in the transparent mode or
the latch mode.
tCk
The input clock. The clock must l¡e inhibited in the event that a stall occltrs when the
BusX is operating \n latched mode. Otherwise, incorrect data may be latched into the
BusX.
The internal control signals can be broken into:
o Direction (t/*)
The read/write signal determines the direction of data flow through the Bus Exchanger.
This signal can be a buffered version of the input read/write signal.
o Select (Sel0, Sell, Se12, Sel3)
These four signals are the decoded 'select one of four' signals. They are decoded from the
input select signals and their complements (ø0,ã0,a1, øt). They can be decoded using four















The signals a-cont, b-cont and c-cont in Figure 6.11 must be decodetl for each of the
drivers. As the connection to the bus must only be made once, the transmission gate
controlled by a-cont must disconnect the driver from the bus unless the correct select
signal (SelX) is present. a-cont is the only control signal that depends on the driver select
signals. The control signals are:





where 'sel' is the appropriately decoded select signal, t k I are the decoded transparent
and lal,ch rnode signals, arrtl A & V are riuilg and fallilg pulses respectively.
Note that the control signals above are for data being applied to the drivers from off-chip
for a read. The opposite driver will require a similar set of control signals, except with the 'r'
and'tu' signals exchanged.
6.6.2 Read/\il.rite Path
For data passing from one of four inputs to a single output, a 4-to-I multiplexer is needed, while












Figure 6. 1 3 : Four-to- O ne Multiplexer/ Dem ultiplexer
bidirectional, data must be able to flow both ways through the mux/demux combination, and
so transmission gates were used, rather than combinatorial logic. Such a 4-to-1 bidirectional
mux/demux is shown in Figure 6.13.
The VLSI layout for this configuration is simple, and shown in Figure 6.14. Due to the
regularity of the design, the VLSI section shown is in fact the transmission mux/demux for
two bits, one moving from left to right and the other moving from right to left. The red
vertica,l lines representing polysilicon are the mux/demux control signals. There are two pairs
of complementary control signals running adjacently, denoted {oo,Uo} k {ot,U1}. A" output
signal from a driver will only appear on the metal2line (purple) if the appropriate transmission
gates are conducting. Otherwise, the signal will be blocked.
driver3 driverS
driveß_l ddver3 6
Figure 6.14: VLSI Section of Four-to-One Mux/DeMux
6.6.3 The Complete Bus Exchanger
Eight of the multiplexer/driver units were combined with a single control unit to produce a
byte wide Bus Exchanger unit. The full Bus Exchanger unit is shown in Figure 6.15. The full





Figure 6.15: Bus Exchanger VLSI Layout
k
I20
The Ilus Dxcha,nger was sirnulate<I using ITSpicc for several configurations. As each bit path
is exactly the sarne as all the others, thcn sirnulations of one will produce the same results
as sirnulations of any of the others. Therefore, although il; was checkecl tha,t eac.h bit workecl
correc.tly, the results of only one bit ¡lath are shown except for the cases wheu tlte lesults are
the sa,me for each bit ancl can be showu sitnultaneously.
The flrst test case w¿l,s whcn the Rus Exchangel was opelatirrg in 'latclt' and 'read' ntode,
and the'zerotlL'input was always selectecl (t4ku,¡¡ both 0 volts). The resulting simula,tion results
a,r'c slrc¡wn in Figure 6.16. The plots show the outputs a-otLt, to /¿-ozl following the inputs ø0-zn
to h|-in approxirnately six nanoseconcls a,fter the rising eclge of the clock pulse. The outputs













































Figure 6. t6: Latched Read With Constant Select Signals
i\ext, otre of the select signals (a-1) rvas varie<l palt wa¡r throughthe sinulation' so that
the input rvas chanplecl from the 'zeroth' input to the 'second'input (eg. u}-it't to u!-in). The
results are shown in Figule 6.17. Ihese shou'the output ø-ouú lagging the irtput signal a1-irt,by
approximately six na,noseconds aftel the c.lock for the fìrst half of the simula'tion periorl, followerl
by tlre output becoming the input sigrral a!-in for the remainder of the simulation.
lVith the data flowirrg in the opposite direction (nomirLally a data write), the unselected
outprrts are blocked from the actual output, and the clata flows frorn the'outpttt'side (eg.
o-out) to the 'input' side (eg. aT-in). This c.a,se rvas simulated for clata applied to pins a-otLt to
h, out a¡¡d the select signal ri-1 varied frorn 0 volts to 5 volts half way through the simulation.
The resulting simulation plot is shown in Figure 6.18. The pins that are not used as outputs
(a1-in, a3-in, b1-in. etc) are all at approxima,tely two a,nd a half volts, although of course they
can be pulled to whatever logic level is recluired by attother Bus Exchanger driving thern. The
signals that are driven (aT-in, o2-in, bL-in, et,c.) follow the dliving signal (a-out, Ö-ou'Ú, etc)
only with the correct select signal.
If the Bus Exchanger is lun in 'transparent'mode (instead of 'latched'tnode), then the
simnlations shown in Figures 6.19 to 6.20 are producecl. 'Ihese are the ecluivalcnt simulations
to those i¡ Figules 6.17 to 6.18 respectively, except fol the cha,nge in latch/transparent mode'
T tss e outputs lls
t2r
pu app ruxrlìlap t t e e
,o.ll r



















Figure 6.17: Latched Read With Varying Select Signals









v L I l-
Jl I














































Figure 6.18: Latched 'Write With Varying Select Signals
r22
to four nanoseconds in either direction, and independently of the clock signal.

































































Figure 6.19: Transparent Read With Varying Select Signals































Figure 6.20: Transparent \Mrite With Varying Select Signals
6.7 Loading Double-length Words
If double precision (64-bit) aritlimetic is required, either four cells in the processing array can
uu ltfcrËcu ur u@u@ u@rf uç uJ ufsu urtruuórr
723
bandwidth in terms of the number of operands for the processing array will be one quarter of
that when single precision (32-bit) data is being processed. As each operand is fetched with a
single address, and a double precision operand is twice the size of a single precision operand,
the address generator will operate at one quarter of its single precision rate, and one half of the
bandwidth in terms of bytes will be required for double precision.
For a two-way interleaved memory system when fetching 32-bit operands, the memory system
need no longer be interleaved when fetching 64-bit operands. This will reduce the complexity of





The following performance estirnates a,re based on the memory structures described in Chapter 6.
Some performance figures estimate the operational speed of a 'cut-down' system, typically one
without the main dynamic memory, and hence no cache tag checking. Unless otherwise stated,
it is assumed that the array used has a variable bandwidth, ie, the processing elements are all
tlre same speed. The array is assumecl to be of size p x p, with C MWords of cache memory and
M MWords of main memory. The bandwidth of the cache to main rnemory is B MWords per
second, with r ns per load, and a latency for a cache refill of ,t cycles (= Lr).
7.L Matrix Multiplication
For a rnatrix multiplication running directly from the cache SRAM, the most fitndameutal
calculation is based on a, partitioned matrix product with no delays due to memory. If the
two matrices, A & B, that are to be multiplied together are of dimensions ,R x ^9 and 5 x ?
respectively. then the final prorluct matrix will be of size RxT,divide<t into f f;l x ff,.l partitions,
eac.h of size p x p. The input nra,trices will be <livirled into f f;'l þ, lil partitions respec.tively, of
size p x 5 fi 5' x p. The tirne to perforrn t.he product is sirnply calculated by
Tsir¡tp.-¡¡.LLtt: [4] -lf] -pxs'xr (T.l)lpl"lpl"''
Tlie floating point performance in temrs of trIflops (Million Floating Point Operations per
Second) is simply the number of floating point operations done divided by the time taken as
given in Ec¡ra,tion 7.1.




The performance plot for this simple estimate is shown in Figure 7.1, below. In this plot, tlie
array size is assumed to be fixccl at 1024 processing elements (32 x 32, p -- 32), the load/store
tirne per word (r) is ten nanoseconds. and the matrices rnultiplied together aïe assumerl to be
squate(r?=5-").
For comparison purposes, siguificant benefit.s are obtained if either the array size is increasecl
or the mernory load/store time is reduced. F igure 7.2 shows the perforrnance if the arra¡r
dimension is doubled @ : A+). and lrigure 7.2b) is thc performance if the load/store time is













































Figure 7.3: Performance for Array Dimension : 64 and Load/Store Time - 5ns
(MFlops)
The significance of Figure 7.3 is that it shows the squared nature of the performance of the
array for a constant bandwidth. This is an example of the concepts of address bandwidth that
lead to Equation 5.4.
If the data must be loaded from nlain rnemory, then cache refill tirne must be included
in the equations. There are two cases to be considered for a cache refrll, the cases with and
without data streaming. As described in Chapter 6, data streaming is when data loaded from
main rnemory is available directly to the processor simultaneously to the data being written into
the cache. Without da,ta streaming, the c.ornplete data block must lìrst be written into cache
memory and then the clesired da,ta read frorn cache memory. If it is assumed that the cache
is of size úl MWords and a cache block is of size D words, then for a direct mapped cache in
which the data is stored in the same order that it is required (ie in partitioned form), then a
few assurnptions can be made that will provicle a sirnple estimate for system performance.
It is ea,siest to break the tirne requiretl for a, calculation into two categories, those when there
is an integral numbel of a,rray paltitions iu a rnatrix (,1? rnodp = 0) and those when afull alray
boundary is not filled (,R mod p I 0). Clonsider reading data into the array for the case of an
integral number of partitions (.R mod p:0). Then the time taker to load all the data into the
cache will be the latency plus t,he fìll period. ie
^t,Tlou¿ x lÌ'OIR'r
The data that is loaded into the cache will be used immediately, so it will need to be retrieved
from cache memory. This will take another R2r cycles, after which the tliag<-rlal blt-rcks t-rf the
solution (C;;) will have been calculatecl. This will te""" (f; - t) f partitions to be calculated,
each requiring approximately Ëp cache loads. Therefore, the time required for computation
after the data has been loaded into cache will be
?comp R2rt (f - ') 
!o,,
R2ri(+-r) a"
"dàta 1 64--0-5-0" +
127
which totals
7'Lroin-tvr.^ = + e + 2 x D x r) * R2(f - t)' (i.3)
The second case to consider is for .R mod p I 0. Ihen there will be
which will take the sante load and diagonal computation tirne as before, ie
?Lo,d = Lfl #,' *2x Dx') + rrlil(Lfl - 1)'
Adclert to this is the time it takes to compute the unfilled edges of the rnatrix.. The edge
width will be the total wiclth minus the width of all the filled partitions, ie R - p lf I . f f," tir"
of the data contained in this portion of the matrix will be O (^ - t L#]), which must be loaded
and multiplied by all partitions (frlled and unfilled). As there will be edges at two sides, there




(L+ Dxr)+2Rx 4 xpxr
p%dge = '?
These times are summarised in Equation 7.4
#e+2xDxr)+ o'(#-t)'
TMoin,n",Ir^



















For a fixed array size of 102.1 processing elements arranged as (32 x 32) with a constant
bandwidth of 100 MWords/sec., then increasing the block size frorn 4 to 572 results in an im-
provernent in the overall speed of the systern as the block size increases, especially for matrices
of small order. However, there is improvement in pcrformance as the block size increases above
sixteen is small. Figures 7.4 a) through d) show the performance of the processor as the block
size is increased from 4 to 16 to 64 to 512. The other very noticeable feature is the performance
advantage obtained when the matrix can be divìded into a,n integral nurnber of bloc.k of the sa,me
width as the processing array. The latenc-y for a cache miss is a,ssumed to be 80 nanoseconds'
If streaming is added to the functionality of the processor, then data can be used immediately
it is available from main memory, and a separate write to cache phase is not recluired. In effect'
tlris removes a factor of D x r from the memoly load part of the time equations, which will
ed a














































b) Block Size - 16
1000 r200
"dara 2_12_512_10 80" +





















Figure 7.4: c) Block Size = 64 d) Block Size _ 512

















lf )ffP+ox'l+pn Lf I ( lf j-r)'+n t,+2?x lAl xrx t' 'p' '
otherwise
The plots for the performance with strearning inclucled are shown in Þ'igures 7.5 a) through
d). These show the improvement in performanc.e with increasing bloc.k size, with significant
irnprovements up to a, block size of 64, and with more improvement as block size increases
further. This perforna,nce improvement relies heavily on proper placernent of data in mernory
to achieve its full protnise. I{owever, if the data is located poorly in rnemory, the streamed
version will clo no worse than the nustrearned version.
7.2 Gauss Jordan Elimination and Inversron
As mentioned previously in Section 3.'1, one of the a,dva,ntages of Gauss-Jordan elimination is
that a rnatrix inversion uses eracúly the same routine, with the column range extended to an
augmented matrix including the identity matrix. I. The important trick to achieving maximum
perfolmauce for Gauss-Jorcla,n elimination is to very carefully partition the algorithm to fìl, both
the array size and the ca,che memor)'size. If the inverse is sought, the iclentity matrix need not
be explicitly included, as the columus of the identitv matrix can be included as the inverse grows
and the input ma,trix reduces. The performance estirnates here assume tliat the data controller
can strearn clata, and that the clata, is stored in block rows in main rnetnory.
7.2.L Gaussian Elimination for Matrices Smaller than the Cache Size
If the input matrix (the one that is to be reducerl using Gauss Jordan elirnination) fits entirelv
into the cache, the only load from main memory (cac.he refîll) will be the initial load. 'Ihe
procedure given in Section 3.4.1 can be implementecl directly, with no hlocking dtte to cache
size.
The first step is to determine the inverse of the pivot block All. The block must be loaded
into cache frorn main rnem.ory and then operated on to produce the block inverse. Using the
block inversion algorithm from Sec.tiou 11.4.1, five iterations a,re required to produce an invetse,
with each interation recluiring three pa,sses through the processing array. for a total of flfteen
passes through the array. Addecl to this will be a factor due to the final scalar division of the
initial estimate being completed entirely after all the da,ta in the row has been fetched, the
penultirnate scalar division will comrnence with one load still to occ.ur', etc.. The time to pelforrn
130
































200 400 600 800
@crix Dimension
I 000














zo0 400 ó00ht¡ix ÞimeDsion
Figure 7.5: c) Block Size:64 d) Btock Size - 612
Figure 7.5: Performance Estimates for Matrix Multiplication With Streaming
ndara l_12_16 10_80( +
131
the flrst block inversion will also include the time to load the block from main nemorY, and can
be characterised by
,t ^r ,5p2r.W,-{lil:: ':,ffJ," (zB)
wlrere 7'""olo, diuid,e is the time it takes to perform a scalar inversion operation, and is in terms
of cycles. Once the first iteration of Gauss-Jordan elimination has completed, then all the data







If the matrix to be solved is smaller than the size of the processing array, then the pivot
inversion stage is all that is required. In general, this will not be the case, and the pivot row
will need to be normalised and all other rows updated. As a column oriented scheme will make
memory accesses simpler (Section 3.4.3), then the row normalisation and the matrix update can
be performed together. Bearing in mind that there for (n - 1) updates, the clcth) of which is of
length (rr. - k), then the average length of an update will b" diÐ, and so the time taken for
the complete Gauss-Jordan elimination will be
l#l - L I I5p2r ¡rtrcahrdiuitle,




(7.10 )TGauss-Jortlnn (lËl * l^Pl) x z, +








The nrrmber of required operations is approximately Rs - I.5R2 + 0.54, so the perfolmance
of the array in MegaFLOPs when performing Clauss-Jordan elimination is
p",/cJroru ¿'- 1'!411l's4 (2.11)e - T'G'uss-Jordrt,
Figure 7.6 shows the performance estimates for solving a matrix of size less than 1025 rows for
various cac.he block sizes.
If the inverse of the ma,trix is required, then the sanìe procedure is used, except that the
updated matrix rloes not reduce in size. Equation 7.10 can be used, excep
lfl fl#l -')
t that the factor
, and a factor ofin the sec.oncl to bottorn line becomes
7Ìr is added to the case of R < 2p. 'fhen an estirna,te of the time ta,ken to perform a matrix
r32

























































Figure 7.6: c) Block Size = 64 d) Block Size - 512




l#l - L + tsp,r ¡r'scatuT diuidc ,
(lÉl * tq+'l) *' + (u7'' t r1"nto, a¿u¿¿e) x r
ifR<p
ifp<R<2p
(7.t2)T Inuert (l#l *14,+dl) xz+ 5p' x¡f
(z
('
lfl -') (l?l *f R Rp D
bp2, + 
r,', 
co'l rtT d,i'ui,d'e R-p
p
otherwise
Note that this is actually slightly conservative) as a full update is not required for the added
rows. The given matrix is augmented a column at a time by a column of the identity matrix.,
which for the 'i¿â' update is the column vectot (0, 0, . . . , 0, 1, 0, . . . , 0, 0)", with the 'f in the iúÁ
position. Therefore, rather that requiring two multiply-accumulate pairs for a'n update (one to
load the array), and one rnultiply-accumulate pair for the normalise, the result of the normalise
is the inverse of the pivot, which is already calculated, and the update is only a single multiply-
accumulate pair. However, as the order of this saving is small, the difference on the overall
performance estimate is small.
The number of re<luired operations to invel't a natrix using Gauss-Jordan elimination is
2R3 - 4R2 + 5,R + 1, and so the performance of thc array when inverting a matrix is
P"rrhrr,"rt È 2R3-4112+íR+l (7.13)
T Inuert
Figure 7.7 sho.rvs the performance estimates for invcrting a matrix of size less than 1025 rows
for various cache block sizes.
7.2.2 Gaussian Elimination for Matrices Larger than the Cache Size
If the rnatrix to be solved is too large to fit into the cache, then the matrix can be partitioned
into equal sizecl blocks that do fit into the cache. Ideally, the blocks will be the same size as the
cache. Thus, fol a cache size of one llIegaword.
o a rnatrix of size 1500 x 1500 would be pa,rtitione<i into four blocks each of size 750 X 750
¡ a matrix of size 2000 x 2000 would be partitioled into four blocks each of size 1000 X 1000
o a matrix of size 3000 x 3000 would be partitioned into nine bloc.ks each of size 1000 X 1000
¡ a nratrix of size 10000 x 10000 would be partitioned into one hundred blocks each of size
1000 x 1000
The proceclure is very simple, and is just an extension of the block Gauss-Jordan algorithrl
given in Section 3.4.1. The'pivot'block is inverted using the algorithm described above in
Section 7.2.1. The block thus inverted is multiplied by all the other blocks is the same row)
with the pelfornance estirna,ted by Bquation 7.7. All other blocks are updated as for a' standard
Gauss-Jordan elimina,tion method, and then the procedure repeats. To estimate the performance
134

























































Figure 7.7: c) Block Size : 64 d) Block Size - 5I2
Figure 7.7: Performance for Gauss-Jordan Inversion on Matrices that Fit Into Cache
rdatå_s 12_16_10 ao' +
135
f : {C, so the block rlimension will be the largest number that is less than / while still being
an integlal factor of the full matrix dimension. Let the block dirnension be denoted by p, and
let tlrere be {2 blocks of size p x pilr the ma,l;rix (ie R: {p¡.
r.Larse GJ x €Trnr,,,rt+ (2€ - 1)(€ - r, (t) TMuttro e.t4)
Tlre number of operations is again ,ll3 - 1.5R2 + 0.54, and so the performance of thc svstem
when solving a rna,tlix that is lal'ger than the c.ache size is
pe,f 
Lurqe GJ N 
R3 - t'5R2 + o'sl¿ (2.1b)
TLarge GJ
= (7.16)
Equation 7.16 is plotted for a range of block sizes in Figure 7.8. An interesting point to
notice is that the performance actually peaks for a lnatrix a few times larger than the cache
size, before declining to the previous Gauss-Jordan asymptote. 'lhe reason for this peak is
linked to the perforrnance difference bet,ween matlix nmltiplication and matrix inversion. As
the method of adding two matric.es is to load one matrix into the array (a multiply-accumula,te
pair with accumulator reset) and then multiply-accunula,te the other matrix without resetting,
Gauss-Jordan elimination can be expected to run at approximately half the speed of an equiv-
alent matrix multiplication. However, the normalisation phase does not require an addition, so
operates at multiplica,tion speed. Therefore, if the normalisation phase is of a similar order to
the update phase, the processing array will be running at the speed of a matrix multiplication
for a signiflcant period. ,As the update phase is of an order that is the square of thc order of the
normalisation phase, the peak speed will be when there ale two to four blocks in a dimension.
This is shown clea,rly in the plots.
7.2.3 Conclusion
The plots given in this section show tha,t the matlix l)loc.essor coufigured with modest dimensions
(- 32)and address generation times (- 10ns)is capable of speeds in excess of three GigaFLOI's
for Gauss-Jordan elimination aud invelsion. As the plots in Irigure 7.8 show, the size of a
problem that can be solved at this speed is largely lirnitecl not by the processing array size or
rate, but by the memory size.
An interesting point of reference is thc Linpack1000 and Linpack10000 benchmar:ks [23].
!'ersions of these benchmarks can be sumrnarised as the number of floating point operations
required rlivided by the time ta,ken to perform a 1000x 1000 ancl 10,000x 10,000 matrix reduc.-
tion trsing any rnethod. The import,ant point a,bout the requ,ired operations is that any extra
operations are not includerl in the operations count. As this is the case for the plots given in
this chapter, the Linpac.k1000 and Linpack10000 benchmarks will have ratings in excess of three
GigaFLOPs. It is interesting to note that ìt would take less than one second to solve a set of
1000 linear equa,tions, while a sert of 10,000 linear equations would be solved iu approxima'tely
six minutes.
7.3 Discrete Fourier TYansform
The procedure outlilred in Section lJ.5 leads to sornc vely sirnple estimates for the perforrnanc.e
an em array an
136
rrreilurv sy etìlatlt













































4 000 5000 6000
Matrix Dinenslorì
4 000 5000 dooo
MÈrix Dimérsiôn
7000 8000 9000 t0000
7CC0 9000 :cu00
1000 20 00
I 000 2000 I 000
4000 5000 6000
htrix Dimêû6ion
b) Block Size - 16
?000 8000 9 000 10000
f000 8000 r 0o0c








Figure 7.8: c) Block Size = 64 d) Block Size - 512
Figure 7.8: Performance for Gauss-Jordan Inversion on Matrices that Do Not Fit
Into Cache
"data 6 l2 16 l0 80" +
'dara 6 l2 64 t0 80 'daÈê_6_12 512 lo_80'+
137
+i,!
ernbedded system will not have a multi-level nìelnorv, although this may be necessary if verv
large size DFTs are required.
The perfolrnarìce of a two dimensional DFT is simply the uumber of floating point operations
diviclecl by the time for two multiply-ar:curnulate operations and one unloa<l operation. The
inclusion of the unload is due to the fact that the second matrix product is dependent on the
result of the fìrst. However, if the size of the matrices is la,rger than the size of the processing
arrav, the unloading of each partition ca,n be 'hidden' within successive partition products, and
the calculation time becornes the time for the two tnatrix products.
If a higher r¿-clirnensional DFT is perfolrnecl, then the unload operatiott is again hidden by
successive matlix products, as all premultiplications carÌ be cornpleted before the postmultipli-
cations are started. Therefore if the size of the first dìmension to be calculated is Iú1, then the
time to cornplete the first; dimensional computation will be
2N1 x Tntatrir product i Trmloarl
Here, the unload time is due to the very final unload before the output bank can be swapped
back. If the dual ported RAMs in the feedback path are available, then the temporary output
data, ca,n be returned to the input without any bank swapping, and uo unload delay is required.
The time to complete eac.h dimension is then:
Tper elintensior¿: 2N ¿ x 1-mtttrir prodr.tct (7 '17)
As the length of a transfolm is the product of the n dimension lengths which are rnutually
prime, the time taken for only selected transform lengths will be estimated, and the performance
in equiualenú MegaFLOPs provided. The equiva,lent: MegaFLOP rating is calculated by dividing
the number of necessary operations as required for a standa,rd I'I¡T algorithm bv the time taken
to perform the calculation. Ihe text by Press eú. al. [73] was used t;o deterrnine an approximate
operations count, which was found to be approximately 6.5Nlog N for a transforrn of length lú.
The processing array is assumed to be squarc, of dirnension p x p. The times are shown in terms
of the cycle time r", which is the time is takes to loacl a single data element into the array from
menìory, or fi'om the cache in the case of a multi-level rnernory system. Note that the Fourier
coefficients rn'ill not need to be reloacled into ca,che nìernory in a multi-level tnetnory system as
the inprrt tlata must be, so 1he loacl time will al'w'ays bc clepeudent on the input data and not
on the Fourier coefficients.
7.3.L Discrete Fourier Transform Without Multi-level Memory System
If only a, single level rnernor'¡' sysl;crn is usecl, the pr<lcedure and estima,tes are very sirnple. It
can be a,ssumed that the four banks of nremory are still used, but there is not cache tagging to
be checked. The Fburiel coeffi.cients reside in one of the nìemory banks, while the input data
is deposited into another of the banks. A third bank is needed for temporary storage during
processing and for storing the output data. while the input data for the next iteration of the
DFT c.an be loacled into the fourth bank in preparatiou.
Two Dimensional Transform
If the length of the transform is less than p2. then the DIrI' can be completed in one iteration.
Al1 that is reqrrirecl is a ma,trix product. a,n unload, ¿l,ncl then a,nothet'rnatrix product. Due to the





if no consideration is made for matrix products on matrices that are stnaller than the processing
array. If such consideration is made, the time Ì'educes to
'I'gD È (2 x 1ür I Nz) x px rç (7'19)
where N1 and 1y'2 a,re the two ditnensiotls of the transform, and N1 > N2-
A selection of two dirnensiona,l transforrn lengths and their computational time in cycles is
givel in Table 7.1for dimension lengths smaller tl'Lan the processing array size, which is assumed
to be of size 32x 32 processing elements. The performance in MegaIrLOPs is also given, assuming































































































Table 7.1: 2 Dimensional DFT Performance
The data is plotte<l in Figure 7.9. This plot shows the performa,nce improvement available by
using two simila,r' sized matrices instea,d of a la,r'ger and a smaller ma,trix that produce a simila,r
transform length. The far right hand side shows the peak performance for the array, which is
slightly less than 2000 MFLOPs.
If the tra,nsform length is grea,ter than the produc.t of the maximum dimension of the two
matrices allowecl by the proc.essing arra,y size (without partitioning), t*o options are a'vailable.
Bither
1. Use a higher dimensional DFT
This is dealt with in Section 7.3.1
or
2. Partition the two input matrices to fit within the processing arra-v
If the partitioning option is chosen, then all four dirnensions of the adclress geuerator ale re-
qull-ctt tul il,uuulllij,uru lrid,rLrLrurtrrt6. llle ulllttj UU LdlLurôLU a Pór r,rLrrrrruu Lwu urrrrcrror'--'rrcu
þ
139
Figure 7.9: Performance of Two Dimension DFT - FLOPs vs DFT length
is the time to perform the two partitioned multiplies plus the time to unload the last partition
calculated in the first multiplication. Table 7.2 shows the performance details of selected two
dimensional partitioned DFTs with each input matrix being less than four times the size of the




















450 500 550 600 650 ?00 150 800 850 900
2600 2800 3000 3 200 3400 3600 3800 400 0
r'+
rt!
Figure 7.10: Performance of Partitioned Two Dimension DFT - FLOPs vs DFT length
Some performance estimates for selected larger-sized matrices are given in Table 7.3, and
are plotted in Figure 7.11.
Three and Four Dimensional Tlansform
The alterative to using a partitioned two dimensional transform to perform larger DFTs is to use
*df
df




















































































































































































Figure 7.11: Performance of Partitioned Two Dimension DFT - FLOPs vs DFT length
be taken with care, as adding a new dimension of smali matrix dimension will result 1n 2(n - I)
inefficient product groups for an z-dimensional mapping. To illustrate, consider increasing a
two dirnensional mapping to three dirnensions, and let the original two sizes be 31 and 29 on a
32x32 processing array. Then using a three dimensional mapping with dimension sizes 31,29
& 4 will require 31 (29 x 4) and 29 (31x 4) matrix products, which would be very inefficient
on the 32x32 processing artay. However, as the size of the dimensions increases, the efficìency
increases, and a higher dimensional mapping will have a higher performance than a partitioned
lower dimensioned mapping.
The time to perforrn a multidimensiona,l DFT is similar to the time to perform a two di-
mensional mapping, except that more calculations can be achieved before the the array must
stall for an unload operation. In fact, if proper use is made of the dual ported RAMs, the
unload stall can be avoided altogether. Table 7.4 shows the perforrnance figures for selected
three dimensional DFTs, which are also plotted in Figure 7.12
The time taken for a four dimensìonal transform can be calculated using the estimate for a
three dimensional transform. The performance estimates are shown in Table 7.5, and graphically
in Figure 7.13
7.4 The Kalman Filter
The Kalman filter described in Section 3.6 can be directly implemented on the matrix processing
array with the hank-switched memory a,rchitecture tha,t all previous performance estimates have
used. The irnportant performance measurement for the Kalman filter is the time it takes for a
complete update, as this is in effect the repetition rate.
Two example systems can be estimated, one running entirely from static RAM (effectively
running out of Cache RAM with no tag checking), and the other using a multilevel memory
subsystem. Additionatly, the size of the problem for each memory configuration can be varied
such that the problern size is sma,ller than the processing array (no partitioning required) or
the problem size is larger than the processing array (partitioning is required). Each case is
considered separately. The processing array is assumed to be of s\ze p x p, with an access cycle




















































































































5000 10000 1500 0 2000 0 25000








































































































100000 200000 300000 400000 500000 600000 700000
Figute 7.13: Performance of Four Dimensional DFT - FLOPs vs DFT length
++
144
a,ssunred to be of length p
7.4.t Small Problem Size Running FYom SRAM
This is the case of p ) p. As the processing array has a constant bandwidth, then tl-Le processor
must stall after applying a wavefront until the time for a full widtlt prorluct ha,s been completed,
ie if p - 1 operands are applied as a wa,vefront to the pr x p processing array) the processing
array rnust stall for one cycle, if p - 2 operands are applied, the processor rnust stall for two
cycles, etc. This only applies in one dimension of the applied matrices. lf the matrix product
ABisperfolmedan<l Aisofdirnensionrxsan<l BisofdimensionsXl,thentheapplication
of both A & B will take ps cycles, and the uriloading of the result will take max(pr,pú) cycles.
Therefore, if all matrices al'e assumed to be square ancl of dirnensiolì p, tton-square matrices
use an equivalent time if the larger of the two dimensions is made the dimension of the square
matrix.
Proceeding in the same order as is provided in Table 11.1, the tirne to compute the matlix
K(,4) can be estimated as follows:
T1 x ppxr (7.20)
Tg È 2xppxr (7.21)
T D"lny x ¡,2 x r (7 .22)
72,1
Ti,,r. æ t5p'2r I t-^i*r (7.23)
T"rrl È pQxr (7.24)
where the time for inversion is from Bquation 7.12. Surnming these to determine the time to
calculate K(,t) produces Equa,tion 7.25.
?It x 'f 7 -f T¿r + 2TDehy * Tirrr. t Tencl
/ T2^^^,^-. r,^. \x, Ill,P+2rz¡'scalat diuidel, Q'25)- 
\^I's 
' -,- .2 
/
Siniilarly. the determination of P(Æ+ 1lk) rerluires frve matrix products and one accumulate
which takes the same tirne as a ntultiply, u,hich a,ll can take ¡rlace without stalls. The estimate
for the P(k+ 1lk) update is theu given by Fìcluation 7.26.
Tp x Sxppxr (7.26)
llhe flnat part is the state estimate vectol update *'(k + 1lA'), which recluires four tnatrix-
vector products. As the processing array has a constant bandwidth, the matrix-vector products
effectivelyconsurìethesarnetirneasamatrix-matlixproduct. Therefore,thef((k+1lk) update
consnmes time according to Ecluation 7.27.
T¡ x 4xppxr (7'27)
Surnming all components provides an estirnate of the time the Kalman filter takes in this
simple implementation. The resultant tirne is given in ìlquation 7.28.
TKalntanl : TX*TplTt





This result is shown plotted in Figures 7.14a) to d) for array sizes of 32 and 6'4 and cycle
times of ten and twenty nanoseconcls. The plots show the time on the vertical scale ancl the
nurnber of elernents in the state vector i( Ã.lÆ - 1) on the horizoutal axis. The plots show tlte time
increa,sing linearly with vector size, as the matrix processor rnultiplies matrices in approximately






























































l0 l5 40 45




'sm t3Ì 20ns 64' -È
50 55




Figure 7.1,1: 20ns Cycle Time c) Array Size : 32 d) Array Size - 64
Figure 7.14: Simplc Kalman Filter for State Vector Srnaller Than Array Dimension
7.4.2 Large Problem Size Running trYom SRAM
Now the problern must be pa,rtitioned into matrices that fit the pÌocessot array size. The time
to perform the multiplication of two matrices that a,re larger than the size of the processing
array is equivalent to the sum of the tirne to calculate all the full rnatrix partitions plus the time
to calculate all the partial matrix partitions.
6m ka1 Ì0Ds 64" +
146
tirnes required for the computation can again be broken into the component calculations of
K(k), P(k+ 1lk) & î(k+ 1lÀ). The time estimation for the computation of K(Å,) proceedes as
follows.
To calculate the ternpora,ry matrix L, each of the partitions L¿¡ must be calculated in turn.
The time for these calculations has been rietermined in Fìquation 7.1, and is
lnurt:lfl ,.l;l xpx,s'xr (72s)
The calculation of the temporary matrix U a,lso requires an addition at the start of each
partition calculation, which will add approximately p2 cycles to each partition calculation time.
The time to calculate IJ can then be calculated from Ðquation 7.30.
Tg x(zxS+n2)xr
x(5+p)pxr (7.30)
The inversion procedure is aga,in sinlilar to that determined earlier, except simplified for the
less complex memoly system. The time 1,o invert each ¡r x p block is again
/ . T?^^,..^ ,,^. \
4rruÈ ('tr'*-y-dt') "' 
(7'31)
and so the total time for the inversiou of each of the pivot blocks is sirnply the time for one
inversion multiplied by the number of partitions, a,s showrr in Ðc¡ration 7.32.
ralr piv = (r,,' -U-ryu-) .'- l;l rr2)
In each column, there will be a maximum of Ifl nartiti"ns, so there will O" , lå.l - 1 updates
per column. The total nurnber of updates for the cornplete inversion process is then the average
number per iteration multiplied bv the nurnber of iterations. This is expressecl in Ecluation 7.33.
?ir,U inu x Tall piv * Trprlate
È (,,0, .\-yon) . ,. . l;l - (, l;l - ,; 
(lil- ')'o,,- (i rr)
After the inversion. the matrix K(Å;) can be determined with one extra rnultiplication, which
takes time according to the previously detelmined Bcluation 7.29. Note that no delay slots
are required, as the feedback of one paltition can now be hidden by the calculation of another
partition.
The ca,lculation of P(/i+11ft) is simply three partitionecl multiplies, a,s for Bquation 7.29, and
then two inplace multiplies and an inplace accurnulate. 'Ihe time for the inplace multiplications
are each approximately equal to the time to perform a multiply-accumulate operation, as the
only difference is the ordering of the multiplications of the partitions. The time required for
the inplace accumulate is simply the time recluired for an 'arrav sized' rnultiplication, as it is,










update P(/f + llfr) can be approximatcd by the time to perform five partitioned rnultiplies plus
one an'ay-sized multiplication. This is sumtnarised in Ðquation 7.34.
TP upclate x 37Þurt. mult. f 2l'Part. mult-a,cc * ?Part. acc
= (b.t + ,, lfl - l;l x r, x r (7 -34)
When ca,lculating the updatcrd state vector x(k + 1lA.), we recall that the matrix-vector
product operation must use the full bandwidth of the a,rray. Therefore, the matrix-vector product
will consume the same time as a matrix-nratrix product with the vector replaced by a matrix
of the sarne width as the processing a,rray. lhe time to perform each of the three matrix-vector
products involved in the calculation of x(k+ 1lk) can therefore be approximated bv the time in
Ðquation 7.35. while the vector accumulate takes the time shown in Equation 7.36.
?vect mult È l;l " ppxr (7.35)
(7.36)?vectacc x PXPXr
The total time for a large Kalman filter with a simple nemory system is then the sum of all
the parts, as shown in Equation 7.il7,for the case R : S : T : Q.
Tlurg" Ka,lman : Tx t TP * T¡
T.2r"ntr* diux (l;l " l;l " Qp+p)pxr+("o'* 2 ).'.1;l -
(, i;l -'¡ 
([il-')' 
p2rt l;l'. o* n*') *
(l;l " l;l . ''prp),*.)- ((, l;l -r) *na*')
È (l;l'p (80 r zpr + (¡ [;l . r) zc +
(,,0, -\-ryun) l;l - (, l,el -,¡ 
([Ë]- 
')' ,,). (7 .37 )
Equation 7.iÌ7 rvas elaluated for a ra,nge of vectol sizes, impletnented on processing arlays of
size32 and 64 processing elements pcr side and with cycle times of ten and twenty nattoseconds.
The resultant times are plotted in Figures 7.15a) to d).
7.4.3 Kalman Filter With a Multi-level Memory Sub-system
If a multilevel memorv subsystem is userl. then the time estimates for the various parts of the
Kalman fllter can be obtained directly from the ecluations derived in Sections 7.I k 7.2' The
ordering of operatìons in Tables il.1 &¿ 3.2 still applies, as do assurnptiotrs above concerning the
nnrnber of ecluivalent cycles used in the estimates of Sections 7.4.1 k 7.4.2. A's the equations
in Sections 7.1 þ. 7.2 assume data is initially stored in main nremoty, then no cache reuse is
assumecl between successive operations. Therefore, the performance flgures estimated here can














































Figure 7.15: 20ns Cycle Time c) Array Size : 32 d) Array Size : 64
Figure 7.15: Simple Kalman Filter for State Vector Larger Than Array Dimension
{lq_kal_]ons l2n + "19_kð1 lons_64¡ +
'lq kal 20Ds 12" + lq ka1-20d 64'+
149
large enough to hold morcr clata), the perfonnarì.ce r.vill approach the case of the systern running
directlv from SRAM.
As before, the calculatiou of 'K(4,)' will require two rna,trix prodrtcts, otle itt-place accurnulate
an{ one matrix ilversion. The matrix products rvill take tirne cletermined by Equation 7.6 for
the full size of the matrices, while the time required per partition for the inplace accumulate uses
the same equation except for the matrix size is the same size as the processing arrav (= matrix
partition size). The time fol the nra,trix inversion can be obtained direc.tly from Equat'ion 7.12.
The time taken to calculate K(å) can be estimaterl to be as shown in Ðquation 7.38.
?vrl x x 2Tnx ¿r mult - l;l' * r' ,x p rnult + Tp xp invert 
(i'38)
Similarlv, the calculation of 'P(fr + 1lk)' r'equires five multiplications and one in-place accu-
mulate. 1ìherefore, the time required can be approximately deter:rnined using Ecluation 7.39.
f o12?tr,g,pxsTnxpmulr-l;l *Tp*pmult (i'39)
To calculate thc updated state vector'*(Åt + llA)', the conputation time taken by a matrix-
vector product needs to be estimated. As the matrix in the matrix-vector product will be the
{ominant factor for loads, then we ca,n estirnate the load & calculation time, assuming streaming,
to be as shown in Ec¡ration 7.40.
?rnar./vect. prod = (#t** l;l .') (7.40)
Similarly, the tirne to accumulate a vector with the conteuts of the processing array is simplv
the time required to load a partition of the input vector, as shown in trcluation 7.41.
rvecracc=lål Lrp2xr 3-4t)
Therefore, as the upclating of the state vector rec¡rires three matrix-vcctor multiplies and one
vector accumulate, the time lequirecl to calculate x(A + 1lA)is as shown in Ecluation 7.42.
TttL * ry ;f 7ìrìal./vect. protl * Tvect acr: (7 '42)
The time to complete a cycle of the Kalma,n filter is then the sum of all the updates, as
shown in Ec¡ratiort 7.-13.
TIi,{L .v.1" = ?l,tl, x -l Tur, P * ?NIL * (7"13)
Equatio¡ 7.43 was evallatecl for a range of state vectoL sizes, using block sizes varying
between 4 and 512 on a 32 x lì2 processing arl'ay with a ten nanoseconcl cycle time and 100
nanosecond main memory latency. The results ale shown in Figures 7.16a) to d).
As is block size increases, the time to compute the Kalman filter can be seen to decrease.
However, there is minimal significant diflerence between block sizes of 64 and 512. II is stressed
agairr that these estirnates a,re lower bounds. The practical performanc.e should be much better,




































































t50 tù0 150 1005!.
Figure 7.16: a) Block Size -- 64 b) Block Size - 512
Figure 7.16: Lower Bound on Kalman Filter With Multilevel Memory
"ML kal 512" +
151
7.5 Conclusion
Ihe perforl)ràrìce estinrates shown in this chapter provicle a, sigtiificant insight to the potertf ia'l
pelforrnancc that is available usìng systolic techniclucs couplcd with clever addlessing aud iu-
terface units. Verv few computel st'st,crns can provide sustaiued perrforma,ncc in the CligaILOP




8.1- A Multi-Processor Teraflop Engine
Although nlany in engineering see this fleld as ¿ continual push forward toward an ultimate goal
of inflnitely fast proc.essors a,long the most direct path, some observers liken the process more to
a corkscrew than an arrow. We have seen processor and mernory lirnits leapfrog forward, with
a new topology suggested at each irnprovernent. IIowever, history shows that some ideas keep
resurfacing on a regular basis whenever t;he rela,tionship of technologies is at the correct pointl.
The architecture suggested here is just such a, ca,se, nothing new) but a familiar architecture
revisited.
The discussion that follows is not an attempt to define a complete implementation of a full
multiprocessor system. Indeed, the design work required for suc.h a task would recluire several
lnan-years. It is meant as an ìnclication of the a,pploximate pet'forma,nce that would be available
given such a systern, and also sonì€ì suggested enhancerlents that could be used to extract the
most from the basc svstem.
8.1.1 The Hypercube
The interconnection of nodes rvitìrin a multipro<:essor is the subject of muclt deba,te at plesent,
with rnany different topologies being suggested as the 'icleal'. Curt'ent c.omtttercial topologies
include the mesh, fat-tree, torroidal rnesh and hypelcube from lntel, 'Iìhinking Machines, Cray
and nClube, to nante a few [29].
llot long ago, the hypelcube, or n-cube. wa,s touted as the ultima,te interconnect, with cotn-
mercial lnachines fi'om Intel, Thinking X,Ia,chines and nCube and experimental ma'chines being
developed at Caltech, C\,Iti and othet'establishments. The hypercube is a'rich'structure that
contains many others within it, and as such can fully utilise its own stl'ucture or be reconfigured
to simpler topologies such as rnesh, ring or tree structures. As such, the hypercube matches or
includes many configurations that naturally occur in nathernatical, scientiflc. and engineering
problerns. Indeecl, Ðdelman of the Ilniversity of Cjaliforrria,, Berkeley, stated receutly [25], "The
hypercube with full concurrent communication to every node's nearest neighbors is the only
mathematically elegant communications network that has ever been devised."
One of the niceties of hypercubes is that they glow bv replica,tion. Given two identical
m,-cubes (each rvith 2'' nodes). an (rn f 1)-cube is formed by linking theit'corresponcling
nodes in a one-to-one fashion. This leads to the node numbering concept that, for an n¿-cube
Iabelled with the nurnbers 0 to 2"'- 1, the nodes are connectetl in such a way that the binary
representation of c.onnected nodes differ in exactly one bit. An example labelling fot' a 3-crtbe
ts an AS IS
153
is given in Figure 8.1. Altelnative labels allows a ring of arbitlary length /, where 4 < I < 2^,





Figure 8.1: Node Labelling for a 3-Cube
Why Not Use It?
The move away from the hypercube architecture is generally due to the diffic.ulty in scaling the
hypercube above a dimension of about ten (ie 21o = 1024 nodes). With a hypercube of size
2", eaclt node has l¿ connections, one to each of its nearest n neighbours. Thus, the number of
interconnections grows rapidly with mac.hine size, posing a, dilemma: if'the interc.onnects are all
bit-serial coppel links, the speed is often ina,decluate. On the other hand, if the interconnects
are parallel. byte wide or larger, then a la,rge number of semiconductor pins are required to
drive them, a,nd the system becomes unwieldy and difficult to rnake reliable. As Michael Meirer
of Thinking l{a,chines said [9:ì]. "flnless ¡'ou have all the haldwa,re drivers and communica'tiotr
channels on chip. vou cannot implement thenr. You end up with too much hardware; the
reliabilitv and sheer size become unrnanageable."
The hypercube was viable in the 1980's because the computational units were simple enough
("g. Thinking Machines CX{-2) ol slow enough (eg Intel i/PSC) that a bit-serial or lower-
bandwiclth comrnunica,tion stra,tegv could be userl. As the number of nodes and the no<le per-
formance increased, so did the numbel of interconnections and the communic.ation requirements.
New topologies such as the mesh and fat tree were introduced (actually reintroduced), as local
communication with these topologies is nachine-size independent. Therefore, the only way to
reintroduce the hypercube is to make the interc.onnections sufficiently simple tha,t the gains in
topology greatly outway the connectìon complexity.
E.1.2 Fibre-Optical Interconnection
Current multiprocessors using metal interconnects increase the nocle-to-node bandwidth by
transmitting data with wiclths of bytes (B bits), wolds (32-bits) ot' long words (64-bits). There-
fore, the interconnection is physically large, and the capacitance that must be driven and dis-
charged, proportional to the length of the line and the number of lines, will consume large




F ibre optics, on the other hand, tvpically beneflt fi'om:
¡ much lower attenuation for a similar length of line
. a propagatiott speed ittclepetrdent of signal length
o lower signal aud clock skew
r higher irnmunity to clossta,lk, electrornagnetic interference and ground loops
r flexibility in 3-clirnensional space (not constrained largely to two or three perpendicular
directions)
Fibre-optic. interc.onnections used in comrnunication systems achieve speeds between 1 and
1.5 Gigabits per second on a standard serial connection, and2 Gigabits per second is available'
Therefore, a bidirectional internode connection rate of 320 Mbytes per second is quite feasible
using four strands of optical fibre [79]. As this design is for a moderate number of nodes (16
to 1024),limiterl by other factors such as power dissipation, problem size, etc., each node will
have at most ten (: logz 1024) connections to othel nodes. As ea,ch connection contains four
optical strands, the total number of strands leaving a node will be at most 40. It is interesting
to compare that a mesh connected systern with four neighbours to each node (North, South,
Bast, West) using standard copper interconnec.ts would likely have 32 or 64 copper tlacks or
strancls between nodes, a similar amount to an optically interconnected hypercube.
The interconnection bandwidth and internode latency determine the time required to trans-
mit data from one node to another. However, broadcasting the sarne data to all nodes in a
hyperc.ube is not a simple matteL, and is discussed in Appendix C. Ultinately, the tirne to
broadcast 'L' woL<ls over a, hypercube with 2* nodes can be minimised to that in Equation 8.1
(8.1)
where .J is the node-to-node comrnunication lateucy ancl t is the time to trattsmit one word to a
nearest neighbour node. For an optica,lly conuecterl hvpercube, both r¡¿ u,ttct d are implementatiolt
dependent, while 1, is problern size clependent and r. is technology dependerrt.
E.l.3 Utilising the Hypercube - the Proposed Model
In the proposecl architecture. the tnelnory is distributed among the nodes and processors, ie
there is no shared nemory, even within the node. However, due to the logical labelling of a
hypercube's nodes, the memory system can be viewerl as physically distributed but globally
and logically shared. A custom VLSI clesign implernenting a rlirectory search engine utilising
an algorithm similar to the Stanford DASH archit,ec.ture [50] is suggested, as this allows the
scalability of a message-passing rlaclLirLe l,o be corutiired with the model of a shared-nÌemory
architecture for ease of programming.
The suggested node configuration is one that is based on a hybrid of the Intel Paragon XP/S
and a higher performance vector supercomputer. The Paragon XP/S contains up to five i860
RISC processors, allocating four processors to the computational task of problem solving anrl
the flfth cledicated to the task of managing the nessage passing and node connections [93' 94].
A high end rnultiprocessor vector superconìputer typically uses a relatively small number of
very powerful vector proc.essols (tvpicallv one to sixteen proc.essots, each capable of between 500




of matrix engincs with attached scalat' urrits (t1'picallv thlec.) and the associatccl commt¡lticatiolt
halclware (anothr:r scalar processot'pl us custorn \¡l-Sl directorr'/<'ohet'cnc1'lnalt¿ìger attri ttocle
link hartlware) at each noclc. T[re nodes al'r: then replicated irtto a ltyltelcube, to becotne a,lt
Nf PP2 of very high perfollrì¿ìlìce uodes.
The advantage of lhis apploach is that the cour¡rlcte s¡,'stcnt can achieve high perfortlta,ltce
with a relatively rnod<¡st rrurnber of nodes. ì)espite clailns bv Ilurkhardt [93]. of Iiendail Scluat'e
Rcse'arch. a,ncl others, tha1, l,heir massively palallel plocessol'nrac.hines ale ottly lirnitecl bY eco-
lìornics, the latenc¡' that nlust, exisl fol long conrnrunicatiolt ¡laths as 1he s¡'stem grows r.vill restlt
ir detl'inrental c<¡rrrmunication overheacls for problertts tha1. t'c<1uile a, large atrtourtt of <1ata tno\¡e-
ment between nodes. 'Iheleforr:. by keeping the r¡ulnbel of nodes and hettce the critical path
between nodc's rclatively small thc cornrnunication ovelheacls will be r¡ttlch sma,ller for similar'
perforrna,nce.
As rnentionecl alrove, each node will r:onsist of apploxiuratel-v three tuatt'ix eugines, each
with attached scalar processor', for computation. A rninimurr of nine banks of memory w'ill be
required fol thc ntatrix processolsì so incrcasing the nurnber of banks to sixteen will allow ample
extla banks for swapping of da,ta between cornprrtation processors ancl betwccrt computation-
comrnunic.ation plocessol's, -lhe iutelface between lhc nrernoly ancl the processols catt be either
an extension of the rnenìory structure usecl fol a single matt'ix atray, as shown in Figur-e 8.2a).




f M,t,";-;f I ,
+-
Figure 8.2: a) Extended Simple Memory Structure b) X-Connect Memory Structure
8.1.4 Algorithms
Multiplication on the Hypercube
Severa,l authors have investigatc.cl tire ef[iciencv ancl perfot'nrance of nratrix multiplication otl tì
lr"vpercube architecturc [29. 28. i6.42.7-1. 78]. The acl of splitting the probìem acloss sevelal
processors necessitates tlrc paltitioning of t,he probleur, r¡.'lrich rvill getnelallr- lte rectattgrtlar. In
fact. it has been found 1hat. for a st,andalrl scalar'l)rocessol'arcltitectttt'e at tlie ttocles. sqtlale





va,rious 'roll and rotate' algorithms have lleen developed to exploit thc special st'ructure of the
hypercube [29,28, 74, B6].
l)ue to the structure of the rnatrix engines at ea,c.h node, which perfot'm substantia,lly better
for long 'strips' of proclucts, a rectangular structure would be preferable. As such, the matrix
is partitionecl to allow la,lge 'inner products of outer products'. Consider the matrix product
C: AB, where A is a PxQ matrix, B is a Q x R rna,trix, and so C is a P x R matrix.
Now partition the matrix into 'processing array' sized strips, ie A is partitioned into 'tl p x Q
pa,rtitions and B is partitioned into pQxppartitions for processing olt a,rraysof size pxp.
















B1 Bz Bp( ) (8.2)
Cø C4,z C,þn
This naturally suggests a ring structure with the paltitions of one rnatrices rotating around the
nodes.
Now distribute the partitions eclually among the nodes. Ideally, there should be more parti-
tions than processing arrays. If this is not the case, it may be more efficient to use fewer nodes,
and run another problem on the unused nodes. Assume that there is at least one nore partition
at each node than there are processing arrays, ie for a system with three processing arrays per
node, there are four or more partitions at each node. Therefore, one of the partitions can be
transmitted while the others are being operated on. This is shown in Figure 8.3.
Figure 8.3: Processing/transmission Diagram for 1st Partitioning
The choice of matrix that is lotaterl is arbitrary, but careful selection will yield a better result
tha,n a, poor choice due to pl'ocessor load balancing. If one of the matrices is partitionecl such
that the nurnber of processing array width partitions is equal to or just smaller tltan a multiple
of three (the number of processing arrays at a node), then make this the stable partition. If all
the partitions in the stable matrix are multiplied bv a single partition in the rotation matrix, the
157
nratrix (the rotating matrix) bv one partition, and repeat the multiplication step. R,epeat the
procedure above until all partitions in one matrix have been multiplied by all partitions in the
other matrix. This is shown graphically in Figure 8.4 for tlte nultiplica,tiolt of a natrix with ten
partitions (A) with a matrix with twelve partitions (B) ol a multiprocessor with two nocles.
Jj'l






Figure 8.4: b) Second Multiplication Phase After First 6Roll'
AstherearesixpartitionsoftheBmatrixperrrode,thentwopartitions(ffi
of B are multiplied by each partition of A for each 'roll'. Table 8.1 shows the cycle at which
each multiplication is completed, with 'a' indicating the first rnultiplication in a cYcle and 'b'
indicating the second. and also the processing array in which the multiplication takes plac-e (in
'[]').
The bottleneck in the process can be eithc'r the comptttation or the data tt'ansmission, de-
pending on factors such as thc internode latency, the cache pelformance and the nurnber ofbands
at each node. To estimate the pcrforntauce of the system, the cornputation and transmission

































































































































































































Table 8.1: Parallel Multiplication Allocation Table
time will be approximately the number of bands of the rotating matrix at a node multiplied by
the number of stationary bands at a node and by the timc it takes to multiply a single band
divided by the rìurnber of nocle processing alrays. This Jras been sutnrnarised for a system with
{ nodes and three p X p processing a,trays per node peïformilìg the rnatrix procluct C : A.B
where A has dimensions ,þpx Q and B has dimensions Q x 7p,in Ðquation 8.3. It is assumed
that there are nrore matrix band partitions than there are processing arrays, and that matrix B
has closer to, but less than or eclual to, a multiple of three partitions per node tha,n matrix A'
/ 
ìr rrurripry " lil) . å (B'3)$arallel nrultiply È nrax (Tirunsmit'?þarall<
The transrnissiou tine is sirnply the inverse of the internode bandwidth multiplied by the size
of a, band partition ancl thc¡u adrled to the intentode latenc.y, þ. The size of the bancl partitiorr
is the length of the partition. rvhich is Q. multiplied bv the width of the partition, which is p.
Equation 8..1 sumnrarises this.
Tt.asrnit= bar#dth+d (8"+)
The computation tilne is derived fì'om the nìernory loacl time, and is the cycle tirne plus any
nìemory latencies. A full matrix band partition will only fit completely into a cache rnemory if
the'Q'rlimension is less than the c.ache size clivided by the band partition width 2. For a one
M\Vord cache and 32 x 32 processing array, the maximum length band partition that will fìt
into the cache is
Q-u,x ca.h" :12
32768
In general, the matrices that will be multiplierl on a multiprocessot will be larger tltan this, so
cache reuse will be at a minimum. Tìlrerefore, cachc data streaming is almost essential, as is
careful placement of the data in nìenrory. Thc cornputation tirne for a bancl partition cal be
approximatecl by Ðquation 8.,5, whele'/l'is the cache block size attd'r'is the load/store cycle
tirne.
j' , P ()v,tvrLQ*P u',' ,, lS.5l
1048576




Equation 8.3 can be explcssed in rnorer detail, as showtt in Ðquation 8.6.
TJrar.aller 
'urtiprv 




This ecluation was ervaluated for a lange of seler:ted rratrix sizt¡s on various sized multipro-
cessor configura,tions. Because there are three degrees of freedom in the sizes of mat,rices that
can be evaluated, nanrely the P, Q & R values, the ra,nge of na,trix sizes tha'l, can be sirnula'tecl
becornes very large. To simplify the simulation process, both the A and B matrices were chosen
to be square, and a,lso to be the sarne size. I{owever, the perfortnance should be similal irr the
event that this is ttot the case.
The performance in FLOPs (Floating Point Operations Per second) is shown in Figures B.5a)
through c) for internode latencies of 100 nanoseconds, 1000 nanoseconds and 10000 uanoseconds
respectively. They show that, once the rnatrix becomes a size large enough to be effectivcly
handled by the multiprocessor system (uinimal idle processors), the processing arlays achievc a
sustained speed that no longer increases with ma,trix size. Note that the estimate assumes that
a ba,nd will always be fetchecl from ma,in memory, so that it underestirnates the performance for
small matrices.
Gauss-Jordan Elimination
The procedure for Ga,uss-Jorrlau elimination on a multiprocessor is the same as for a single pro-
cessor, except that the update phase is processed using all the processing ttodes. To implement
the algorithm effìciently, the input matrix A rnust bc partitioned and distributed evenly arnong
the nodes. 'Ihe m.ost obvious partitioning is to divide thc natrix into a,s îìany c.olumns as there
al'e nodes. However, as (ìa,ussian decomposition zelos a column at a, time, the node containing
the first column is only used or.rce, the nocle containing the second column is only used twice, etc..
which is a very inefficient use of the computational power available due to processor utilisation
misrnatch.
Instead. distribute the ro*'s arrìotìg the nodes, an<1 'then oprrra,te on the columns within the
rows stoled al a node. (iauss-Jorrlan eljrnination for the nultiprocessor becorues the loutine
slrorvn in Figure 8.6, u'hich provicles good processor utilisation rnatching. The P x R matrix is
divided into'u" px R. row part,itions. which each contain'Q'p x p column partitions.
If the l'ows are distributed arìtorìg the nodes in gloups the sa,me sizc as the nutnber of
processiug a,rrays at a node. the nunrbe'r of requiled nodes will increa,se ottly when the efficiency of
doing so justifies the inclusion. Therefore) orì a, multiprocessor with { nodes and thlee processing
arraysateachnode,rowpartitions I.2.3,34+t,3€+2,3€+3,...areplaceda'tnode0,partitions
4,5,6,3{ + ¿,34 + 5,3€ + 6,... a,t norle 1. etc. The node allocation function for the 'dlÀ r'ow
pa,rtition of a nra,trix will be
alloc( i ) rnod{ Ili1n¿ (8.7 )
Performance Estimation for Gauss-Jordan Elimination
A simple estimate of the t,inte to completc a Gauss Jordan elimiuatiou on a hypercube can be
rlade. Like the estima,tes for matrix multiplication, the example performances aÌe proviclecl onlv

























L0000 20000 10000 40000 50000 60000 10000 0 10000 20000 30000 40000 50000 ó0000 ?000c














l0c0c :cc00 10000 40000 t0c00 ú0000 ?000c
*
Figure 8.5: c) lOOOOns Internode Delay
Figure 8.5 : Performance Estimates for Multiprocessor Multiplication
for k: 1 to d'
Â¡e = l;|
for j:ft11 1o n
Ârj: Ar¡Âur
broadcast pivot row
for each processing array
for each "o* f in a 
partition
forj-,t+1tog
Âî¡: A¡¡ - A;rÂ*¡
Figure 8.6: Distributed Row-wise Gauss-Jordan Elimination
,'mulE 2rdee 1000ns" +-
"nult_{ndes 100onsrr +-tsrìul E-Bndes looonsÍ B
'--'ru1t:¡Hde3:tæhsr x
"nqf È-l2nde€ 1000nsr' + 
_
"mulc-64ndes Ì000nsr' + -nmulr 128nodes l000ns" .)












The time to perform a (ìaussian decompositiou such as (iauss-Jorclan elimination can be
estimated frorn the time to complete all the jobs in the serial tash list. The.iobs within the task
list rnay be rlone in parallel, but at severa,l points thloughout the procedure, all the processot's
nrust wait on a single processor. Suc.h a case in Cìauss-Jot'dau elimination is when the processol's
in the node in which the pivot row is held must wait for the inverse of the pivot block to be
calculated before the pivot row can be nonnalised. Additionallv, all other processing nodes must
wait on the pivot row to be norrnalised befole the other rows catl be updatecl. Therefore' the
inherent serial natule of l,he algorithrn constra,ins the nra,ximurn parallel efficiency a,tlcl hence
performance speed up.
The tirne estimate is based on the serial procerìure:
1. Invert the pivot block A¿¿
All processors except the one perforrning the inverse are idle at this stage.
2. Normalizethe pivot row
All processors within a node can be active for this operation as the row can be distributed
throughout the node. All processors at other uodes are idle.
3. Broadcast the normalized pivot row
This is a communication phase, the time for cornpletion of which is dependent on the
broadcast parameters in Ilquation 8.1. The broadcast need not be included in its entirety
as the broa,dcast can conìûìence as soon as the first pivot row element has been calculated,
while the update pha,se c.an stalt as soon as the pivot row has been received at a, node.
However, for the sake of argurnent, the inclusion of the broaricast tirne in the serial proce-
dure will cause a rninimum bound on the expected pcrformance, and hence a mote exact
calculatìon can only get bett,er.
4, Update all blocks in all othel rows at ea,ch norle
This part is the nrost pa,lallel part of the algorithrn, as all processot's are in use all the
time. This is the main aclvantage of Gauss-Jorclan elimiuation for a parallel architecture
over other matrix recluction schernes. as there will a[ways be a constant number of rows
being operated on. In fact. rvork at Claltc'ch has shown that Gauss-Jordan elimination,
traditionally considered the least, efficient algolithm for matrix solution, cart solve a system
fäsler than man.y othel algolithms because of its very goocl processor usage properties[36].
If there is only a single row in a node, then the row can be divided among the processing
arrays at a, node, and all processors operate on the same row partition. If the number
of row partitions at a node is a multiple of the nurnber of plocessing arrays in a node
(norninally three), lhen ea,ch processitrg arl'a..\'ca.n exclusively operate olt a row partition.
5. Repeat the proc.edure above until all columus are reduced
The time to invert a block stored in nra,in nlerÌoly will be the same as Equation 7.8. The
following normalìsation phase will be one thircl the number of column partitions in a low partition
rnultiplied bv the tirne for a single block size multiply. fhe time to perform the block size multiply
is the tirne for the node with the largest number of row partitions (there could be several with
the same number) to update its rows. The update time is twice the multiply accumulate time
for a block sized nrultiply multipliecl by the number of blocks stored a,t a node clivided by three.
The factor of two is present because the origina,l block must be loaded into the array before it is
upclated. The factor of three is due to there being three processing arraYs at a node. The total
tirne is the tirne for all iterations ott a dec.reasing sizecl tnatrix.
Because the rnatrix size is reduc.ing in one dimension only (the numbel of rows operated ot-t
¡
pef COlUmn fS AtWAyS COnSIAnI; OnlY Ilte nlllllDer- Ot COtUtIIttS I'cuuccìj,rr Lltc ueuled¡tlltÉ rrrd,utrÀ ùr¿E
h
162
carÌ be replaced by a r:onstant sized matlix with the dimeusiou of tlie averagc' of the decreasing
natrix dimensions for the sake of time calculations. ìlquation 8.8 sunmarizes the times for a
P x R rnatrix on a multiprocessoï with { nodes. each with three processing arrays of size :32x32.
The number of row paltil,ions is denoted by r¿' and the number of colutnu partitions is derlotecl
by p, such that P : 32l,and rR. -- 32p.
T Po,r. Gauss-Jord,an
?brock murt X 
@/ (' *, läl)) . -
The number of necessary operations is still the same as fol a single processing array, ie
ops = n3-1.5n2 *0.5r¿ for an nxn matlix (P: R. = r¿ in this case). This leads to an
expression for system perforrnauce, which is given in Ðcluation 8.9.
perf por. (lotLss-Jordan N P- 
1'5n2 + 0'5n (g.g)
I I)ar. (i uu.gsi-J ord.ttn
Ec¡ration 8.9 was evaluaterl for svsterns with between four and 128 processing nodes' each
with three processing arrays, with varying internode latencics. The results for internode latencies
of 100ns, l000ns and 10000ns are shorvn in lrigures B.7a) to B.7c) respectively.
(8.8)
8.2 'Wavelet Processor
This is a simple clescliption of a, systolic arlal'suitable for perforrning a compact wavelet trans-
formation. The impetus for this came while investiga,ting applications for the matrix engine, clue
to the rna,trix ruultiplication uature of a wavelet transfolrn. HoweveL, although sor-ne forms of
the wavelet transform are suitable to implemcnt on the matrix engine [52] , the compact wavelet
transforrns deserve a more specialìsed architect,ure.
Similarly to the Fourier Tlansforrn. a lVavelet 'I'ransform translates a vector of compotlents itl
one domain (eg time) into an equally sized vector in another domaitt (eg frequencv). Unlike the
Fourier Transform, for rvhich the tra,nslation operators are sines and cosines, which are localized
in frecluency but infinite in time, the operators for the Wavelet Transform are localised in both
time and frecluency.
The translation matrixfor the cornpact Daubechies tra,nsforrnis a cyc.lic, nearly-band matrix.
The dimension of the matlix nrusú be a power of two. Suc.h a matrix is shown in Figure 8'8
(from [Zf]¡tor the case of four coefficients. The coefficients are chosen such that the odd rows
of the transform matrix produce a 'smoothing' efferct on the input vector, while the even rows
, atternpt to zero, or 'clecirna,te', the response [19] .
The coefficient matlix can be modifled to remove the cyclic. \Mrap-arourìd condition, or t'ather
to lrelp visualise fhe rnefhod oF-õþìrrg wifh lhe l"acflhãIThe lasfrryavelet t-oeffit'ieul,s are affected
= (lritot * 4rormalir" t Tlr.oudcast * 7ìp.1ut") x (n'o' of ront partitiorts)
/ ( o - 1) .: (Iui"ot * Tbìo.k mult, X ï It
- (rÆ - r,,4 -, ltl ztk,.k murt X %t, r) " r














































Figure 8.7: c) Latency : lOOOOns
Figure 8.7: Performance Estimates for Multiprocessor Gauss-Jordan Elimination
-H, llitÍ:ã:: ï331s:: B
s
r64nôdes_100ûSr a
,,12tnode6 10ons,r -* -
r4nodeÉ 100nSil +


















Figure 8.8: Four Coefficient Cornpact Wavelet Transformation Matrix
by both ends of the data vector. The new nìatrix becomes the one shown in Equatioll 8'10.
According to Press et. al. [73], it is possible toeliminate the wrap-alound completely, leaving a
purely band-diagonal coefficient matrix without changing the size of the matrix, although they












































One ploperty is that the transforuts of the odd rows a,l'e stored in the top half of the output
vector and the even rows a,r'e stored in the bottorn half. Thts the top half c.onta,ins the'smooth'
cornponents and the bottom half contains the 'zero'ecl cornponents. The 'smooth' compotlent
of the output vector ale then recirculate<l again in a pvramidical fasltion, smoothing to half the
components and decintatilìg the other half until olìly one 'srnootlt' comporìent remains. The text
by Press et. al. provides a, good graphical desc-ription of the tra,nsform arìd permute proceclure,
sinilar to Figure 8.9 below.
Of course, there will be no actual 'permute' sl,age, as this is simply a stotage routine' The
address pattern for loading will be
pre"--eddr + I
while the patterrì for storing rvill be
preu-adtlr I \ for even Lows
prc,u-otld,r - i+ 1 for odd rows
or even more simply) two countels can be used, the first starting at one) the second af \ 1- I'





















































Figure 8.9: Pyramidical Transform and Permute Procedure
nature of the algorithrn, the use of constant matrix coemcients and the spatsity of the matrix' As
the bandwidth of the matrix is always 'small' compared to the matrix size, then the <limension
of the array can be reduced by one, ie to a linear array the same length as the number of
coefficients.
If the array is preloaded with the coefficients, then the input vector can move from one end
of the array to the other, with the 'smoothed' part of the output vector moving back to the
input end and the 'decimated' part of the output vector rnoving in the same direction as the









Figure 8.10: Cell from a Linear Array for the \Mavelet Transform
A block diagram of a c,ell in the the linear array is given in Figure 8.11, showing the arravl
the input and output vectors, and the dual ported RAMs used for recycling the smoothed data'
Then if matrix coefficients are in ascending order from left to right and the input vector is
applied from the rigirt hand side, the inputs propagate to the left and ultimately all end in the
output vector in the correct form. Because of the ordering in the systolic array of the rnatrix
coefficients, the'sntooth'colnponents will all be computed simultaneously. 'Iherefore, coefficients
1 ancl 2 cal be added, whìle all the others are delayecl by suitable elements. The accumulated
coefficient products are then shifted right, ancl accumulated with the next coefficient product,


















Figure 8.11: Linear Array for r¿-dimensional Wavelet Transform
'smooth' component emerges from the light hand edge of the array3
The formation of the decinlated components a,lso recluires delays, but this time beca'use soûle
products are available before any other products.
Table 8.2 shows the tirne-step at which each product is calculated for the flrst ten time-steps,
with four coeffìcients. The flrst set of products that are surnnìed to produce the f,rst 'smooth'
coefficient, sl, å,re all calculated in step 4. The required sum is
sr : CoXt i C1X2I C'zXz I C3Xa
If tlre datais rnoved to the'light', then the temporary surn s1 : CoXtlClX2 can be calc-ulated
immediately. The next temporat'y surn sr: sr lC2X3 can be calculated in the next time step,
so tlre product CzXz needs one delay slot. The remaining product can be added one time step
later, according to the er¡uatioll s1 : .ïr * C¡X¿, so two delay slots are required. The decirnated
components nrove to the left. in the same direction as the incoming data. The product C3X1
is calcnlaterl in the first time str.p, a,ncl must ltave the ploduct CzXz subtrac.ted ftomit. CzXz
is calculated in the third time step. so the result of C3Xy must be delayed by two slots. The
tenpo¡ary ,uo ,ir = CsXr - C:'zXz is availa,ble after the fourth time step, and must be added to
the third.product C,'-1X3, which is available a,fter the fifth tirne step, to procluce the temporary
product ,ir: ,i, f C'rX¡. Therefore, the tempolary.sum d1 nìust only be delayed by a single
tinre slot. Similarly. the final'decimated'result ,h -- dt-CoXq also requires that the temporary
sum d1 only be delayed by a single slot. 'fhe reason only the first product term requires two
delay slots instead of one is that it not passed through an accumulator or adder. Therefore, if
all the cells in the array were created exactly alike, ancl the first plorìttct wa,s accllmulated with
'zeto', then orily a single delay slot woukl be requirecl.
With these requirements in mind, it is a simple ma,tter to produce a systolic cell capable of
performing a compact Daubechies wavelet transform. Such a cell is shown in Figure B'12. An
arbitra,ry number of these cells can be butted together to fol'rn a linear systolic array capable of
3Note that a binary tree would have been a rnore logical approach to implementing the accumulation of all
the cornponents that are available simultaneously. However, the binary tree does not lend itself to a simple,
























































Table 8.2: 'Wavelet Products versus Time
performing a wavelet transform of any degree within the conf,nes of the algorithm (there must









Figute 8.12: Single CeIl for'Wavelet Implementation
To'undo'the wavelet transform and reproduce the original input vector, i, then the vector
of wavelet values, denoted ,úr, neetls to be rnull,iplìed by the inverse of the wavelet coefficient
matrix, C,ie
i : C-rú¡
Recalling that the compact f)a,ubechies wavelet transform is a transformation on an orthonormal
basis, ie all the rows and columns in C are orthogonal, then the inverse matrix C-1 is simply
the transpose of the same matrix C-1 -- çT. As the transposed matrix is nearly identical to
the original matrix in form. exactly the same hardware can be used for the inverse transform as




8.2.1 Still Picture Compression
Because the Wavelet Transform produces large coellìcients for regions of large cha,nge, or high
contrast, anrl small coeflìcients for regions of srnall change, or low contrast, compression by
rnaintaining only the large' cornponents u,'ill result in minima,l noticeable loss iu image quality.
Additìonally, as the Wavelet Tra,nsform is loca,lisecl in both space and time, then the effects
of omitting a coefficient are localiserl a,s well. This is different to ornitting Fourier or cosine
coefficients, which are not localised, and hence Jrloduce'ringing'when components are omitted.
For this reason, Irourier based irnage compression generallv uses 'blocking', whereby the irnage
is dividecl into many small blocks (tvpically 6 x 6 to 16 x 16 pixels) and the lossy compression
containecl within the blocks. This in itself is another ca,use of poor irnage cornpression, as the
blocks themselves become noticeable for la,rge compression t'atios.
8.2.2 Moving Picture Compression
The MPBG standard (Moving Pictures Bxperts Group) includes both inter- and intra-frame
compression. These cornpressions are between successive images and within an image respec-
tively. Therefore, the amount of computa,tional power required is much greater than for single
image compression. However, as the images will not, in general, be analysed as critica,lly as a
still ima,ge, the compression ratios ca,n be much higher, considerably reducing bandwidth [2].
Similarly, by comparing the difference between successive compressed wavelet transformed
images, a system can be constructed that uses wavelet cornpression on moving pictures with a
highly reduced required ban<lwidth and a, less noticeable loss of quality
Variable Quality Image Cornpression
As the nurnber of coefficients used for the Wavelet 'Iransform is an arbitrary even nurnber ancl
the systolic array for processing the transform is linea,r', the number of coefficients is dynamically
reconfigurable. Additionally, the threshold of compression can be c.hangerl between successive
images. Therefore, it is relativelv sirnple to cleate a system that transmits a reduced quality
image for the majority of its use. while being capable of transmitting high quality images with
an incleased clelay if lequired. Such a system rvould be of grea,t use in a,pplications such as video
telephones and nrobile video conlntuuicatiou. where the banclwidth is limited. The stanclarcl
image of reduced c¡rality could be tailoled to use the full availa,ble bartdwidtli for real time
rnotion, while a high qualitf image (such as a photograph or detailed diagrarn) could be sent
with a delay.
This section olì a svstolic. wavelet plocessol is intenclecl to demonstrate tlte ea,se with rvhich
many algorithms ca,n be converted into systolic. a,r'rays^ not on t;he image compressiott and cocling
itself. For more information on compression, see 173,52,53] and other authors.
8.3 Conclusion
The abilìty to parallelize nrany matrix opera,tions has been used to show that a multiprocessor
version of the matrix engine is possible, with estimated performance figures in the region of
tens to hunrlreds of GigaFLOPS for Gauss-Jorclan elimination anrl approaching a TeIaFLOP for
matrix multiplication on an appropriately sizecl machine for a large enough problem size. These
flgures operÌ a whole realm of modelling in fields such as airflow and fluid-flow that was previously
infeasible, such as the a,utomated optimisa,tion of aìrcra,ft, vehic.le or ship profiles involving the
repeated solution of matrices of the orclel of 20,000 x 20,000 to 1,000,000 x 1,000,000.
169
The systolic process was also extended to the design of a linear systolic array for the com-
putation of the Daubechies Discrete Wavelet Transform (DWT). This transform shows great
potential for use in lossy image compression, and as such a systolic array capable of perform the
computations simply and with a high bandwidtli would be of great use. Applications such as
pay-TV (cable-TV) and video conferencing would beneflt greatly by a reduction in the amount
of data that is required to be transmitted with minimal loss in quality, and that can be modified




In this thesis, the implementation of a systolic processor interface has been introduced anrl
solutions ploposed for the many difficulties encountered in such a task. The design task has
included several separate but related tasks, including
1. The VLSI design of an extended version of Marwood's Address Generator.
2. The construction and development of algorithrns to be irnplemented on the matrix engine.
3. A discussion of arithmetic components used to optimise the speed of the system.
4. The design of a rnernory subsystern to support the high computa,tional rate of the plocess-
ing array.
For comparison purposes, consider the SCAP processor particulars that were presented in
Section 2.2. Although the SCAI' systern has now been updated and is using a slower technology
than that used here, a, point by point comparison is still interesting. This is shown in itemisecl
form below.
¡ Not ouly does the data coutroller prcsented hele contain a interfac.e to memory from the
array, it inclrrdes ca,che mana,gernent and ta,g checking in the intet'fac.e stanclard, which
allows the construction of fast multi-level menrory systems at a reasonable cost.
¡ T'he data controllers are cascadable to an a,rbitrary size. In fact, if a multichip solution
is userl with the address generation and tag checking hardware on one chip and the data
registers and sundry arithmetic on another, ouly the data chip need be replicated for
expausion, as onl.y oue address geuclator per iuterface is required.
o The bus interface will allow conlrnon or independent nternory subsystems. However, unlike
the SCAP system. a cornnìon nìenoly system must allow sorne form of bank switching
or multi-access melnory. as the interface depends that each controller can access mernory
independently.
¡ As the address generator is a Marwood Difference Engine that has been extended to four
dimensions. the same zero cost operations available on SCAP are also available, as well as
zero cost partitioning and numerical algorithms,
o There is uo dilect suppolt for cornplex rnatrices, although these catt be supported in
software. Sub-rnatrices and nolì-square matlices ale stlpportecl via, the diffelence engine.
t7r
r Arbitrary sized matrices are directly supported. The four dimensional clifference engine
will allow the data controller to automaticallv partition any applied matrix to flt on the
fixed sized processing alray.
¡ The data controller can be used to eit'her loa'd or storc results.
o Allocatilg a small amount of instruction Írernory in the Harvard architecture style will
allow the data coltrollers to execute their own instruction streams independently from
a host system, allowing cornplicated a,lgoril,hrns to be perfot'rned in their entirety bv the
matrix processing system.
¡ Each data controller can fetch or stole operands at a, rate of 100M floating point words
per second, rnemory subsystem perrnitting.
A block diagram of the complcte system is shorn'n it Figure 9.1.
Figure 9.1: Complete System Block Diagram
The systenr is ca,pable of speeds only crrrrently available using expensive superconrputer
technology, but instead lequiring onlv CIMOS technology and a much smallel physical area. A
Multi-Clhip Module containing an array capable of three to five sustained GigaFLOPs would be
hand-holdable, and a system built around the arlay could fit on a single printed circuit board.
Because of the srnall size of each processing array, multiprocessors constructed with the order of
one hundred plocessing arlays are viable, u'hich puts the realrn of TeTaFLOP computing within
grasp. The possibility of conuecting 100 CRAY-2 supercomputers together is minimal, even in
terms of the physical area that would be consumed, whereas one hundred systolic processing
arrays connected together is not infeasible.
Although the range of applications that can be solved on the processittg array is Limited,
s¡itable applications typically form the basis for computations that corìsunle a great deal of
available cornputing power with l.ittle user interaction. For example, large air flow and fluid


















Pages 174 to 178
Removed
for reasons of commercial confidentiality
App"ndix B
Proof of Convergence for Iterative
Matrix Inversion
El.1 Initialising the Pivot Inverse Iteration
The residual ntatrir R is defìned as the clifference between the actual inverse and the initial
approrimate inverse. If it is noted that the product of the actual inverse (call this Û) and the
matrix for which the inverse is sought (ie A) is the identity rnatrix I, then the residual rnatrix is
the difference between the identity I and the product of Ys and A. This is shown in Ecluation
8.1.
R=I-Yo.A (8.1)
This leacls to the (infinite) series
A-1
= (I+R+R2+R3+...).Yo (8.2)
From Equation B.2, for convergence. the nonn of R rnust satisfy
llRll < 1 (I1.3)
where the nolrn is defined to be the la,rgest aruplifrcation of length that the matrix is able to
incluce on a vector' [73], ie
¡nll = -qr +1 (8.4)v+u lv I
Press ef al. l73l provicles a good analysis on the work of Pan & Reif [71] for choosing a good
initial guess. They point out that a suitable choice is a suflìciently small constant, e, nultiplied
by the tra,nspose of the matrix for which the inverse is desired, ie
1', (B.5)
If Equation 8.5 is to satisfy the requirement from Equation 8.3, then it can be shown [73]
that R is of the fornr
R: cliug(1 - eÀ1,1- eÀ2,...,1- eÀ¡¿) (8.6)
where À¿ are the eigenva,lues of 4"..A.
Pan & Reif [71] point out that the vect,or norm rec¡rirernent in Ec¡ration 8.4 neecl not be





,<\oln or €<+ (B.z)
j,k 
' - rnax; Ðil"¿tl x maxi Ð¿lo¿¡l
which are much easier to calculate than the eigenvalues.
A further simplification can be ma,de by choosing a different matrix norm from Equation 8.4.












, the "colurnn-sum" norm
llR,llr = mraxl lr;rl (8.9)
, and the ttrow-sumt' norm
llRll"" : -,u*I lr¿¡l (8.10)
J
any of which may be satisfied to guarantee convergence.
_ Recalling that Gaussian eliminationl is stable only for a diagonally dominant matrix (lo¿;l >
Ðlø;¡l¡, then the row-surn norrn of the residue matrix can be made less than one if the initial
i+i
estimate of the inverse (Ys) is set to
Bo : d.iasl#,#. . . ., fi) (8.11)












































If the row-sum norm from Ecluation B.10 is used, then the norm of the residue matrix R
becomes
tl
llRil max f lr;; l
'/=1
rr(rt!¡ùll UU¡tllllúllult rtttu LU UeUUfffpUSlUlulf
180
L
lltax t Qij(l tt.¡:7,.¡li
¡t
Tlierefore, arì irìit,ìal estimate of
will couverge to the invc:lscr nìatrix as r¿ - fc









2Actually, lve hope nrucìr soc¡rLer than n - ::.. 
'l'hr: con\¡crgence shoulcl take apploxinrately logrprcc iterations,
lvhere prec is the ptecisi<¡n ilL bits of the scalals in tlLe nratrix, due to th<: <¡uaclratic natule o{ the algoritìrm.
181
App"ndix C
Broadcasting Data on a Hypercube
One of the rnajor bottlenecks with multiprocessor systems is the accessing of data residing in
main mernory. If the memory is physically distributed throughout the system, data ueeds to be
sent among the nodes by message passing strategies. Two nearest neighbours can communicate
directly, but two remote processors can only exchange data by routing a message through inter-
mediate nodes. Although this appeals to by detrirnental to the performance of the system by
interrupting the intermediate node(s), advantage can be made of this mechanism if a broadcast
is required, as the intermediate node can not only pass the rnessage on, but keep a copy for itself.
The broadcast mecha,nism is very useful in matrix algorithms such as matrix-matrix multiplica-
tion and Gaussian elimination, where c.ornplete I'ows ol colunrns are required sìmultaueously at
all nodes.
The theoretical lirnit for the perforrnancc of a broadcast on a hypercube with 2- nodes can
be obtained b.y realising that it takes at leasl n\13 f r.) time units for the lìrst word to reach
the last node (the furthest a,long the critical path), where p is the latency per word and r" is the
time to send one word ovel a nearest-neighbour link. This word can arrive sirnultaneously with
(rn - 1) others (one per link). After thc first word arrives, the last node ca,n receive at most m
words per c.ycle for tlie rernaining {} cycles, where tr is the number of worcls that are to be
broaclcast. Therefore, if the nurnber of packets of length (Llrn) is greater than the critica,l path.









If a broadca,st function on a, hypercube were to simply broa,dca,st from eac.h node to its nearest
neighbours, which in turn would broadca,st to each of their nea,rest neighbours, a large amount of
reduttdant data, would be transmitted and the control would be difficult. In a system where the
comrnunication bandwidth is a critical factor, such a scheme would lesult in a serious reduction
in sustained performance. A much better way 1,o broadcast data in an m-cube is to generate
a spa,tttrittg tree rvith nz links, as shown in Figule C.1 fol a, broa,dcast from norle 0. The tree is
embedded in the topology, so the tra,nsfer of data from node 0 flows along tlie heavier lines in






Figule C.1: Spanning Tree for a 4 Dimensional Hypercube
The spanning tree uses the same Gray-code that is used in the labelling of the rn-cube,
by leceiving on the link that corresponcls to the first bit bec,oming a'1'frorn a sending node
that has the same bit equal to '0'. As the critic.al path is 'rn'links long, the broadcast takes m
bloadcast time uuits, with cornmunication over the longest length started lirst. For a broadcast
initiated at node 0, then
1. send data from node 0 to node B (frrst 0-1 at bit position 3)
2. send data from nocle 0 to node -1. and from node 8 to node 12 (first 0-1 at bit position 2)
3. send data from node 0 to node 2, from node 4 to norle 6, from node B to rLode 10, and
from n<¡de 12 to nocle 1-1 (flrst 0-l at bit position 1)
-1. send clata frorn node 0 to node 1 . from node 2 to node 3. from node 4 to node 5, frorn
node 6 to uode 7. frorn node 8 to nocle 9, from norle 10 to node 1 1, from nocle 12 to node
13, a,nrl from uode 1-l to notle 15 (first 0-1 at bit position 0)
Of course. if the broadcast rvere to originate frorn anv other node, a bit-wise XOR between
the a,bsolrrte node label and the label of the originating node will produce the correct broaclc.ast;
table. It is obvious from the above procedure and the corresponding spanning tree that, if the
time to transfer 'L' words betweeu two nodes ir l9 + ,tr., w'hele p is the start-up overheacl and
r" is the tirne t,o pass one worcl between nodes. the total broadcast time is given by
TSB:M(P+LT")
C.2 Pipelined Broadcast
A rnethod for ìrnproving the throughput of cla,ta is to pipelinc the data flow [76]. The theory of
pipelining is straight forward, with packets of data progressing in a pipelined manner through
, & single spanning tree. Heuce, if a node at distance'j'frorn the root node has the fìrst packet
-::- 
:of-da#a-hh++ì=pâGke* 2='+ill-h+=a-dista+ee-:+-frÐm++€=r€ot node;rvhile=hh+j!4Îå."1<e+ is=åt+å
23
183
neighbouring nodes of the root node. If the b¡oaclcasl. of length 'L'is bl'ok<¡u into '¡,r'packets
fol pipelining, each <>lsize'Llp', then the to1,al time lor a bloaclcast is sinrply the time for thc
first, packet to rea,c.h the last node plus the tiurt¡ lbr the lemaining packets to plopagate through.
flsirrg the sirnple llroadcast times, the time for the filsl; packet ol length 'L f p' to leach the last
nocle is núlJ + r,Llp). The remaining packets reach the last noclc evely (/i + r.Llp) units, and
there are (p-t) such packets, so the¡ remaining pa,chets take (¡r, -1)(iJ+r,Llp) units. Therefore,
tlre total tirne is T'(p) : (m | ¡t, - l)(13 + r"Llp).
illhe only valia,ble l,ltat cau be rnanipulated is the packet-size ¡t, a,s'n?,, r. and p are rna,chine
depenclent ancl 1, is problenr clependent. Therefore, to rnininize the time ?-, separate the
components with and without ¡r anrl then minirnize with r.espect to ¡r.
T,,,(tt) = ((rri t)i + r"L) ) (¡r,l r 
(ttt - l)r''L) t" ,,VF)
T^(pl








As Ðquation C.2 is hypelbolic, then an cxamination of ¡-r end-points (p :0 & oo) shows that
(m, - I)r"L
p





The procecture above assumes that the nodc. to rode lirks can all be activated simultaneouslv,
and that they are uniclirectional. If the links are bidirecl,ional, then only half the possible
bandrviclth will be used. so 1he nrodification ol Secl,iorr C..1 shoulcl be usecl.
C.3 Parallel or Rotated Broadcast
The sparrning tree in Figure ('.1 only uses 2"'- 1 of the available nt.2('n-r) links, which is a
cotrsidelable waste of resortrces. To utilise the resources in a more efficient rnanner, tm' separa,te
spanning trees can be created. If these are rotated into the nr,-dimensions and the data is
split into 'nz'pieccs, 'n?,' colìcun'ent broadcasts can proceerl, nraking nse of all the links of the
hypelcube. I'igure Cl.2 shows foul spa,nning trees for a 4-cube.
If 'wot'mhole'routing is 1o be usecl (packets of data 'burrow'through the network in a,
wormhole fashion), the act of splitting the data into r¿ pieces is a natural requirement, a,nd can
be used to our beneflt. For a, broadcast of 'L'words, the time to broadcast using the rotating
broadcast scherne is the sarne as the time to broadcast Lf nt lvords using tbe first (sirnple)
sclteme, a,s the initial da,t,a, tra,nsfer has Ìreen split into zn, packets of the same length. 'Iherefole.
the total tirne to tra,nsfer 'L' words is
Tn(A+r.Lfml:nt|lLr.
. Note that the packet-size inclepeu<lcnt telm Lr, ts uot clerpenclent on the dirnension of the
hypercube, clue to the use of all links all the time. The only implernentatiol-size dependent





C.4 Pipelined and Rotated Broadcast
To pipeline a rotatecl broadca,st strategyl is not a sirnple matter of c.ombining the two a,pploaches,
as the spanuing trees (eg of Figure Cl.2) are not edge dependent. This is due to the fact
that a pipelined broadcast utilises all links in a spanning tree at an intermedia,te time (not a
start or finish tirne), which will cause c.ontention over links if another spanning tree is used
siruulta,ueously. A new tree called the rn-Ecìges disjoint Spanning Binomial Tree (n-IISB'I) rvas
used by Johnson and Ho [a2] to fully utilize the links in a hypercube. They constructed an
n-BSBT as follows 142,761:
l. start with a standarcl Spanning Binomial Tree (SBT), rooted at node 0.
2. rotatc the binary nocle numbel left to obtain (*- l) new SBT's2.
3. toma,ketheseSBT'sedge-disjoint,negatethe(Æ-I)tnbitofeachnodeonthek¿åspa,nning
tree for 1 < k < ffi- l, and also negate the most signiflcant bit of the original SBT.
4. node 0 is now the root for all the 'm' resulting spanning trees, so the trees can be rnerged
into a single tree of height n't I I. This tree is the required m-ESBT, and is shown for a
3-cube in Figrrre C.3.
Using the expression for thc time to broa,dcast for a pipelined broadcast, the new time to
broadcast data, using an m-ESB'I can be determinecl by simplv replacing the length tr with the
value f (the new length senl; via eiach SIìT), anrl the critical path by the new tree height ntll-.




rOr ¡o[ate a pipelinetl one
2If a node in the cube is representecl by thc binary nurnber b : (ü,.-r, b,,,-2, .. . , år, ùo ), the left rotat,ion
operator, ß¿(ä), is definecl to be 1l¿(ô) : (b,,,-z,bn,--t,
k
18J-r
,lì0, ò,,.-r ), while 'ib' rotations is represented by the
23
10 11 10 11
I I
14 15 14 15
12 13 12 13
t0 11 10 11
I I
14 15 14 15
12 l3 12 13
Figure C.2: 4 Rotated Spanning Trees for 4-Dimensional Hypercube
000
001 010 100





00010 I tlt 100 001 lll 001 010
It0 t0t OII











[1] Anderson E., Bai 2., Bischof C)., I)ernmel J., Dongarra J., Du Ctoz J., Cìreen]¡auln 4., Hanrrnarling
S., Nlcl{enney -d., Ostriuchov S. & Sorensen D.: "LAPACK llser's Guide" SIAM Press, Philadelphia,
1992.
[2] Ang P.H, Ruetz P.A. & Aultl D.:"VideoCorn¡rlession Makes Big Gains." I.E.E.L. Spectrurn, Vol 28,
No 10, pp 16-19, Oc.tober 1991.
[il] Avizienis 4., "Signed-Digit Number Representations for F'ast Pa,rallel Arithmetic." IRE Transoctions
on Electronic Compulers, pp 389-400, Septernber, 1961.
[4] Balakrishnan W. & Burgess N.: "Vely-high-speeti VLSI 2s-conrplemert multiplier using signed
birrary digits." IEE Proceedzn,gs-8, Vol 139, No 1, pp 29-34, January 1992.
[5] Beaumont-Smith 4., Marwood W. & Ilshraghian K.: "The Gallium .A.rsenide hnplementation of a
Systolic Floating Point Ðlement." Procceding of the 1?th IREE Ausl,ralian i\[icroeleclronics Confer-
e?¿ce, pp 255-260, October 1993.
[6] Berry N[., (ìallivan K., Harrod W., Jalby W., Lo S., Meier U., Philippe B. & Sameh A.H.:"Parallel
Algorithms on the Cedar System." ,in Proceedings of the Internaltonal Conferen,ce on [tarallel Pro-
cessirtg an.tl Applicalion,s, pp 25-39.23-25 September, 1987. Published by Elsevier Science Publishing
Co.
[7] Blackley!V.S.,LiniC.S.&Bshraghianl{.: "Architecturefor!'eryHighspeecl GalliunrArsenidePro-
cessitrg Elements for Mat,rix Based (ìonputations." IREE 2nd, Irúentalional lllectronics Conuent.iott
an.d Elilúbilion 1989.
[8] Bode, Dal Clhin (eds.):"Parallcl (lornputer Architectures". Lecturc Notes in (ìorrput,er Science no.
732. Published by Springer Verlag, 1993.
[9] Bradleyl).&LarsonJ.:"FitLc-glainMeasurernentsofLoopPerforrìlance<;n'theCrayY-MP."ClSlìD
Report, C'enter for Superconrputing Research ancl Development, Ilniversity,-lìllinois at llrbarra-
(--lrarrrpaign, 109 l.
[10] Briggs W.S. & N{atuìa D.W.:"4 17 x 6g llit N,fultipty ancl Add LÌnit with RedLrndant Binary'
Feedback and Single Cycle Latency." Proceedings of the I.E.E.E. Eleuenth Sym1tosium on ComptLler
Arithnrelic, pp 163-170, June 29-July 2, 1993, Windsor, Ontario.
[ll] Burgess N.: Lecturt: Notes, De¡raltnrent of Electrical & Dlectroui<: I,ìngineeling, The University of
Adelaide, 1994.
[12] Burrus C.S.:"Inclex Mappings for Multidirnensional Fornrulation of the DF'l'aucl Convolution."
IEEE Transaclion,s on ASSP, Vol 25, pp 239-242, June 1977.
[13] Bursky D.:"Synchronous DRAMs Clock at 100 MIIz." Electronic Desigrt, pp 45-49, Feblua,r;' 18,
1993
[14] Canrpbell J.K, Synnott S.P. .!¿ Biermal (ì.J.: "Voyagel Orbit Detennination at Jupit,er." in l{alntan
Filt.ertng: Theory an,d Appltcalion, ItrEE Press 1985, reprint,tcl fton IDEE Trans. Aulomal. Cion,lr.,
vol AC,j-28, pp 256-268, N'Iar. 1983.
[15] C)aLey G.F. (editor):"Parallel Supercorn¡rutilg: Methods, Algorithns and Appìications." John Wiley
& Sons, 1989.
[16] Cìarnevali P., Radicati Cì., Robert Y. & Sguazzero P.:"Rlock algorithms fol Clauss elirnination and
Householder reductton on t,he IBM 3090 Vector Multiprocessor." in Proceedin,gs ol lhe znlernalional
r87
Clon.ferencc on Parallel Processittg and A¡t\tlications ¡t¡t 297-302,23-25 Septenber. 1987. Publishec.l
by lllsr:viel Science Publishing Co.
[17] Catanzaro El.:"SPARC N'f Bus Overview." Sun Microsystems, Ilc.
[18] Curtis I.,4., Clarke R.J., Cllarkc'A.P. & Nlarwood W. " Data Fornrat,l et." PCT Pat.enl Applicatton,
Novernber 1992.
[19] Daulrechies I.:"Olthonorrna,l Bases of Clornpact,ly Supportecl Wavclels." Cotnmun,i,cations on Pttre
and Applied IIathern,al,ics Vol 41, pp 909-996. 1988.
[20] ''DECChipTM 21064-AA RISCI Microprocessor Prelirninary l)ata Sheet." Digital Equiprnent Cor-
poration, Maynard, Massachusr:l ts, 1992.
[21] "Introduction to Designing a System with l,he DECChipT'M 21064 Microprocessor." Revision 1.0.
Digital Ðquiprnent Corporation, Maynarcl, Massachusetts, 1992.
[22] Dolecek Q.E.:"Parallel Processing Systems for VIISIC." VHSIC Applicaliort,s Worksho¡t, 1984.
[23] Dongarra J.J., Bunch J.R., Moler O.B. & Stewart G.W.:"LINPACK Users'Guide." SIAM Press,
1979.
[24] Dongarra J.J. (ed.):"Iìxperimental Parallel Cornputing Architectures." Elsevier Science Publishers,
New York, USA, 1987.
[25] Edelnian A.:"Large Densc Numerical Linear Algebra in 1993 -'lhe Paralìel Con'rputing Inflttence."
Inlernal Reporl, Department of Mathematics, Llniversity of Cìalifomia, Relkeley, California, 1992.
[26] l'isher 4.L., Kung H.T., Monier L.M., W¿lkel tl. & Dohi Y.:"Design of the PSCI: A Prograrnrr.rable
Systolic. Chip". Proceeding of the 'fhird Callech Confr:rcrtce on. Very Large Scale In,legralion,
Pasadena, Clalifornia, USA, pp 287-302,21-23 March 1983.
[27] Foulser D. & Sc.heiber R,:"The Saxpy Matrix-l: A Geueral-Purpose SystolicCìotnputer". (.iompuler,
Vol 20, No 7, pp35-43, July 1987.
[28] Fox G. ]ley A.J.G. & Otto S.:"Matrix Algorit,lims orr the Hypercu]re I: Matrix Mtrltiplicatior.r".
Parøllel Cornpul,inq. Vol 17, No 4, 1987.
[29] l¡ox Cì., Johnson X,I., Lyzenga C,l ., Otto S., Sah.non J. & Walker D.: "Solving Problerns on Concurlent
Processors, Volurne 1." Prenticc I{all, 1988.
[30] Gallivan K., Jalby \\¡., N{eier tl & Sanreh i\. H.: "Iur¡ract of llielachic.a,l Memory Systerns ott
Lirrear Algebra,{lgorithn.r Dtsign." [nlernalional ,Journal of Sn,percornpul,er Applit:al.iorts, Vol 2, No
l. pp12-48,1988.
[31] Gaston Ir,Nl.Ir. .(. Irwin (ì.\V :"Systolic Iialrn¿r.n fìltelilg: al ovcrview." IEE' [)roc:r:edings. \rol 137,
No 4D, pp 235-244. July 1990
[32] Goldberg D , "Clornputer Aritlimetic" il "(lomputer Architect,ule - A Qua,ntiiative .,\p¡rroach.",
Appenclix r\. Nforgan-Iiaufmann Publishers, hrc., San Mateo, Cjal., 1990.
[33] Good I.J.:"The Relationship Betrveen Trvo Fast Fourier Transfonns." IEEE Transact.irnts on Com-
ltuters, Vol Cl-20, No 3, pp 310-317, Ntarch 197
[34] Handler !V., llaupt D., Jeltsch R.., Juling W. & Lange O.:"Cornpa,r" Lecture Notes in, (,.om.puter
Science, No. 237., Springer-Vcrlag.
[35] Hennessy J.L. & Patterson D.,A., "(,lomputer Alchiter:turc - A Quantit,ative A¡rproach." Morgau-
Kaufmann Publishers, Inc., San Mateo, Clai., 1990.
[36] Hipes P. & Kuppernlanrì A.:"Ciauss-Jorcìal Nfatrix Inversion With Pivoting on the ]Iypercube."
Unpublished Caltech report Cì3P-347, 1986.
[37] Hockney R.W. ,L Jesshope C,l.R.: "Parallel C'orrrputers: Architeci,ure, Progratnming ancl .ô.lgo-
rithrns". Hilger, tlK, l!181.
[38] Ilord. Iì. Michael:"Parallel Supr:rconr¡luting in SIMD Architectutes". CRLI Press, Flolida, IISA.
1990.
[39] Houstis E.N., Papatheoclorou f'.S. & Pcllychloropoulos Cl.D.:"Superc.otttputing" Lecl.ure À'oles irr,
0t tLp'tr o. f1Ir8er- et' ä8'
188
T' t ¿en L'e1
[a0] lllT:"Special l\{ernorir:s and Modules." Published 1992.
[41] Jolrnsorr K.T., Ilurson A.R. & Shirazi B.:"(]cneral Purpose Systolic Arlays.",IEEE (!ompuler,\'ol
26, No 11, Novenrber 199i1.
[42] Johlssol S.L. & Ho C.T.:"Optiurum Bloaclcasting ancl P<:l'sonalized (iornmunicatìon iu Hyper'-
cLrlres." IEEE'Ikan,sacl,ir¡ns ort Conrpulers, Vol 38, No 9, ¡ip 1249-1268, 1989.
[43] IiaÌraner D.K.:"Japan: a conrpr:til,ive assessr¡rent ." IL'1ll': Sper:lrunt, pp42-47, Sept. 1992.
[44] Iiahani:r D.K.:"Superconr¡rutirrg - thc Vit:w flom Japan." IEITD t\'l icro, pp(ì7-70,Feb 1993
[45] Kalmar R.E,:''r\ New Approach to Linear Filtering ancl Predic.tion f)r'oblems." TTansact.ions A,SME,
Journal of Basic Engineering, Vol 82D, pp34-45, 1960.
[46] I{atona E.:"4 lattice model for cellular (systolic) algorithrns." , Parallt'l (lot¡tputinq, \¡ol 3, pp 251-
258,1986.
[47] Kitagawa K. M.:"An MBus'Iutorial." R,evision 1.0a, Sun Microsystems, SPAIìC'Iechnical Market-
ing Division, January 25, 1991.
[48] I{reyszig E.:"Advanced Engineeling Mathernatics,6th Ðclition.".lohn Wiley & Sons, New York,
1988.
[49] Kung S.Y.: "VLSI Array Proccss<lrs" Prer¿t'ice IIall In,forntation, and S'y.str:rn Scien,ces Serics, Er-
glewood Cliffs, New Jersey, 1988.
[50] Lenoski l)., LaurìonJ., (ìharachorloo K., Webel W.-D., GuptaA., llennessy J., Holowitz M. & Lanr
M.: "The Standfr¡rd Dash Processor." IEEE Cont,pttler Vol 25, No 3, Malch l992.
[51] Mace M.E.:"Nlenrory Storage Patterns in Parallel Processing." Klewer Acaclernic Publishers, 1987.
[52] Mallat S.G.: "'Iheory for N{ultiresolution Signal Decomposition using t}re Wavelet Representatiori."
IEEE Transac.tio¡t.s on Pall.ern Ano,lysis ru¿d lllachine Inlelligence, Vol 11, pp 674-693, 1989.
[53] Mallat S.G. & Zhong S.:"Conr¡ract Image Coding flonr Edges with Wavelets". Proceedings of the
1991 Intcrnational Confereù('€ on Acoust.it's, Spccch and Siqnal Processin.g - ICAf;5P91. \¡ol 4, pp
2745-2748, r99t.
[54] Nlarwood W., "A Cìeneralised Systolir: [ì,ing Seria,l l,'loalirg P<¡int Nlultiplier," Illeclronic Lellcrs,
Vol. 26 No. 11, pp 753-754, 24th l\{ay 1990.
[55] N{aln'ood !V., "The Irnplernentation of the lJiscrete Fourier 'I'ransforrn on a Systolic (lonfigurable
.\rray Proct'ssor " lrrlcrrtal ReporI G.,1AS-89-1, Departrnent of lllectric¿rl ¿rnd Electronic Engirteering,
The Univelsity of Âdelaide. 1989.
[56] tr'Iarwood \\'., "A Number Theory'1,[appirrg (ìr:ncrat,ol fbr'Àclch'essing Nlatrix Structut'es." Palenl
Cooperal.ion Trealy Palcnl, June 1990
[57] Nlarwoot'1, \\'.:I)iscussions atrtl (ìort'tsponclence at the [.lnivtlsity of l\delaitle, Novetnber 1991].
[58] Marwood \\I.: "An Integrat,ed Nlultiprocessor for lllatrix Algorithrns" Ph D. Thesis, Departmeut of
Electtical ar¡cl lllectronic Engineering, The tÌniversity of Adelaide, June, 1994.
[59] Nfarwood W & Beaumont-Srnith ,A.., "The hnplernt'ltationof a(ìeneralised Systolic Serial Floating
Point Multiplier." APCCAS'92, IÌiEll, IRI)E and IEAusl, Asra-Paci,fic Conferenc'e on Circuils and
S'yslems, 8-11 Dec. 1992.
fô0] Marwood W. & Clarke A.P.:"4 (ìeneric'linre-Dorna,in Rea,nrfolnrer Ar<:ìritechrre." T/¿c Australio,n
Computer ,lourn,al, Vol 20, No 3, Â,ugust 1988.
[61] Marwood W.,t Clarkc A.P.:"On ClomputingFourier Transforrns Using a Matrix Product NIachine."
Proc. 7th Auslralian. lv[icroeleclronics Confererrce, Sydney, May 16-18, 1988.
[62] Maru'ood \\¡ & Clarke A.P.: "Overheacl Pc'naltics in l)ynamically Iìecoufigrtt'erbìe Arrays of Plo-
cessing Elements." ,lourn.al of El,eclrical and Iìleclronics IJt¿qin.eerirt.g, Australia, June 1987,
[63] X4arwood W. & Cllarke 4.P., "A Nlatrix Product Machine and thc Fourier Transfotm." IEE pro-
ceedinqs G, Ctrcuill;, Deuices rnd,5'ystents, \¡ol 137, No 4, pp 295-301, Äugust 1992.
164l Ma,r'woocl W, {. [,irn C-C-: "A C]a¡\s Svstolic Processor for Irnnìenrenl,inp'a Ka,lrnan Filter-" Proc-
9t.h Auslrali,an Microelectro¡tics ('onference. pp 109- I14, July 2-4, 1990
189
[65] Marwood, W., Shaw T., Liebelt N{. ancl Eshraghian I(.:"4 Data Clontr'oller lor a Systolic Outer
Prorluct Engine." Prnceeding of the 12th IREE Austrulian, fuIicroclectronics Confcrence, pp 221-226,
October 1993.
[6fì] Meact C. & Conway L.:"Introductior to VLSI Systenrs." Addison-\4¡esley, Reacling Mass., 1980.
[67] Moore W., Mc(Jabe A. & Urqrrhalt R.:"Systolic Arrays." Aclam Hiìger, 1987.
[68] illototola: "Mernory Device Dat,a DLllll, Rev7." Puìrlisht:d 1991, ¡tlevious edition 1990.
[69] Nwaclrrrkwu Ð.O.:"Address Genclation in an Ar:ray Ploc.essoL." IEIID Transacliotts on. Compu,lers,
Vol C-34, No 2, pp 170-173, 1985.
[70] Ortega J.M.:"lntroduclion to Parallel aud Vector Solution olLinear Systetns." Plenum Press, New
York, 1989.
[71] Pan V. & Reif J.:"Ðffic.ient Parallel Solution of Lineat Systerns." it Proceedin,qs of tlte Seuettl,ennl.lt
Annual ACM Symposiutn on Thcory of Compu.tnrg, Association for CornpuLing Machinery, New
York, pp 143-152, 1985.
[72] Petkov N.:"systolic Arrays for Matrix I/O Iìorrnat Conversion." Eleclrottic Letters,Yol 24, No 13,
23 June 1988.
[73] Press W.H., Teukolsky S.4., Vetterling W.T. & Flannely B.P.:"Numerical llecipes in C." Second
Bdition. Carnbridge University Press, 1992.
[74] Quinn M.J.:"Designing Efficient Algorithrns for P¿r'allel Cìornputers." McGraw-Hill, 1987.
[75] Reirrhold O. & Marwood W.:"Silicon Hybrids - A'Iechnique for'Zero Defe<:t'lVafer-Scale Proces-
sors.", Journal of Eleclrical and Eleclronics Engincertnq, Ausl.rallø, IEAus(,. &. IREtr Aust., Vol 11,
No. 3, September 1991.
[76] Robert Y.:"The Irnpact of Vectol and Parallt:l Arc]rit,ec.turcs on the Claussian Elimination Algo-
rithrn." Manc.hester flniversity Pless/Halstead Press, I!190.
[77] Robbins K.A. & R.obbins S.:"'Ihe CRAY X-NIP/Modt¡l 24." Lecture Nol,es in Contpuler Scic¿ce, No.
374, Springer-Verlag.
[78] Saad Y. & Schult,z NI.H.:"'I'o¡iologicalProperties of Hypercubes." IEEE Transact,ion,s on. Clont.pulers,
\bl 37, No 7, p¡r 867-872, 1988.
[79] Sarkies K.: I)epartnrent of llìectrical and Electronic Engìleering, The University of Adelaicle, wit,h
Dr K. Sarkies, s¡recialist in high speed communicatiorrs systerns. Personal C)ommunication, 1994.
[80] Scheck P.B.:"Supelcornput,er Archit,cctLrre." ltlewer Acadernic Pu]rlishers, 1987.
[81] Schonauer, \\'illi: "Scieltific Conput,irrg on \/ector Cornputers." Fllsevier Scierce Publishing Clo.,
New York, LrSA, 1987.
[82] Schwidr:r J., Streibl N. & ZüLl It.:"Optoolcctronic Interconnectiorts", iu Parallel Compulcr Archi-
leclures, Lecture Notes in C-omputer Scicnce 732. Published by Springer-Verlag, 1993.
[83] Shaw T.J. & N{arwood W., "(ìauss-Jorclan Blinlinatiouou a Systolic Outer Product Bngine." lnler-
nal Reporl, HPCA-DC;-93/2, Departrnent of tllr:ctrical ¿urd Electrorric Engineering, 'lhe University
of Adeìaide, l99il.
[84] Shaw T.J. ,k Marwood W., "QR clec.ompositiorì orì a Syst,olic Outer Product, Engine." Inlernnl
Report, HPC.A-DC:-9.9/3, The Dcpartnrent of Electrical and lllectronic lìngiueering, The Llnivrlrsity
ol Adelaid", 1093.
[85] Shaw'I.J. & Marwood W., "A lligh t]andwidth, Cìonfigurable Memory Interface to a Systolic Array."
Intc:rn.o,l Reporl, HP(:A-DC:-7?//1, The Depart,rnent ol Elec.tlical and Elec.tronic Fìngineering, Tlie
Llniversity of r\delaide, 1993.
[86] Sharv T'.J. &: l\{arwoorì W., "'lbrvards a'l'eraFLOP: A Parallel Àrchitectule to Support 1000 GFlops
- Sustained." Inlernal Report, HPCA-DC-93/5,'fhe f)epartment of Electlical and Electronic lìngi-
neering, The [Jniversity of Adelaicle, 1993.
f87l Snyder L.:"Introduction to the Configurable, llighly Par¿rllcl Cìomputer." IEEE Contpzler, pp 47-56,
January 1982
190
[88] Sorenson H.W. ed.:"Kaltnan Filtering: Theory and Application." IÐEE Press, 1985'
[89] Stone H.S.:"Iligh Performance Computer Architecture, Second edition." Addison Wesley, 1990.
[90] "SPARC MBus Interface Specification." Revision 1.1, Sun Microsysterns, March 29, 1990,
[91] Takagi N., Yasuura H. & Yajima S., "High-Speed VLSI Multiply Algorithm with a Redundant
Birrary Addition Tree." IEEE Transactions 'in Com2tuters, Vol C-34, No 9, September 1985'
[92] Waser S. & Flynn M.J.:"Introduction to Arithmetic for Digital System Designers." Holt, Rinehart
and Winston CBS College, 1982.
[93] Zorpette G.:"The Power of Parallelisn-r." IEEE Spectrum, pp 28-33, Septernber 1992.
194] Zoryethe G.:"Large Computers." IEEE Spectrunt,, pp 34-37, January 1993.
[95] Zyner G.B.: "Design of arithmetic systems in VLSI." , PhD. thesis., The Department of Electrical
and Electronic Engineering, The University of Adelaide, 1988.
191
