Optimal Distributed Microprocessor Architecture Using Multiphase Processing to Perform a Vector, Matrix Multiplication by Stotts, Larry Gene
OPTIMAL DISTRIBUTED MICROPROCESSOR ARCHITECTURE 
USING MULTI-PHASE PROCESSING TO PERFORM 
A VECTOR, MATRIX MULTIPLICATION 
By 
LARRY GENE STOTTS 
/; 
Bachelor of Science 
Oklahoma State University 
Stillwater, Oklahoma 
1972 
Master of Science 
Oklahoma State University 
Stillwater, Oklahoma 
1977 
Submitted to the Faculty of the Graduate College 
of the Oklahoma State University 
in partial fulfillment of the requirements 
for the Degree of 
DOCTOR OF PHILOSOPHY 
July, 1979 
(/1.~-~ 
1777~ 
s 865($)-
~.;z 
OPTIMAL DISTRIBUTED MICROPROCESSOR ARCHITECTURE 
USING MULTI-PHASE PROCESSING TO PERFORM 
A VECTOR, MATRIX MULTIPLICATION 
Thesis Approved: 
Dean of the Graduate College 
ii 
ACKNm4LEDGMENTS 
I would like to extend my appreciation to Dr. Edward Shreve, chairman 
of my doctoral committe and my thesis adviser, for assistance throughout 
the course of my graduate education. To the other members of my doctoral 
committe, Or. Jack Allison, Or. Craig Sims, and Or. Richard Phillips, I 
wish to thank them for their critical review of the manuscript and for 
their contributions to my education. 
Over the past two years, several persons in addition to the committe 
members have given helpful comments and criticisms on the research des-
cribed in this thesis. Hopefully, no one has been omitted from the 
following list: Or. Gary Poffenbarger, Mr. Steve Hudson, Or. Eugene 
. Bailey. 
Finally, I would like to thank my parents for their support and 
encouragement through my education in school, sports, and life. 
iii 
Chapter 
I. 
TABLE OF CONTENTS 
PROBLEM DEFINITION 
Introduction .... 
Problem and Approach 
II. SURVEY OF DISTRIBUTED PROCESSOR SYSTEMS 
The Unger Computer . . . . . . . 
The Holland Machine ..... . 
The Comfort.Machine ...... . 
The SOLOMON Machine ...... . 
The Gonzalez Iterative Circuit 
The ILLIAC IV Computer . . 
The MCB Machine . . . . . . 
Berkeley Array Processor . 
The Cannon Computer . . . • . . . . 
The General Electric Matrix Processor 
Summary . . . . . . . . . . . . . .. . 
III. DEFINITION OF DISTRIBUTED ARCHITECTURE AND PROCESSING 
Introduction .......... . 
SIMD Architecture .....•...... 
MIMD Architecture .......... . 
Coupling .............. . 
Definition of Time and Space Complexity . 
Summary . . . . . . . . . . . . . . 
IV. OPTIMIZATION OF DISTRIBUTED ARCHITECTURE ... 
Introduction .............. . 
Array Processing Limitations by Definition 
Design Limits of Array Processor ..... . 
Limits on Vector, Matrix Cycle Time ..... . 
Page 
l 
l 
2 
7 
7 
9 
10 
11 
13 
14 
16 
16 
17 
20 
20 
23 
23 
23 
27 
28 
29 
33 
34 
34 
35 
36 
Hardware Monolithic Multiplier Power and Size Problems 
Technique for Reducing Multiplier Delays 
40 
42 
47 
Multiple Phase Processing ....... . 
Optimal Design with Linear Programming 
Optimal Two Phase Design ..... 
Summary . . . . . . . . . . . . . . 
iv 
. . . 
. . . ' . 
49 
. . • • 51 
54 
61 
Chapter Page 
V. PROCESSING ELEMENT DESIGN 62 
Introduction . . . . . . . . . . . . . • . . . . 62 
The Design of Processor Bus Structure for Vector, 
Matrix Products . . . . . . . . . . . . . . . . 62 
Design of Internal Processor Configuration . . . . . 67 
Look-Ahead Shift to Increase Floating Point Operations . 72 
Circuit Configuration with Look-Ahead Shift • . . . . . 77 
Effects of Multiple Phase Operation on the Processing 
Element Design . . . . . . . . . . . . . . 79 
Summary . . . . . . . . . . . . . . . . . . . . . . • 83 
VI. APPLICATION OF DISTRIBUTED ARCHITECTURE TO LINEAR 
RECURSIVE FILTER .......... . 
Introduction ......... . 
Full Order Kalman Filter ... . 
Kalman Filter Realization ... . 
Distributed Architecture Equation Format 
Linear Program Formulation ...... . 
Circuit Design of Kalman Filter Problem 
Summary . • . . . . . . . . 
VII. CONCLUSIONS AND RECOMMENDATIONS ..... 
Conclusions ...........•. 
Recommendations for Further Research 
SELECTED BIBLIOGRAPHY . 
APPENDIXES ..... . 
APPENDIX A - MIXED INTEGER LINEAR PROGRAMMING 
APPENDIX B - LINEAR EQUATION BOUNDARY PLOT PROGRAM . 
APPENDIX C - KALMAN FILTER GAIN FINDING PROGRAM 
APPENDIX D - MATRIX COMPUTATIONS OF Q MATRIX . . 
APPENDIX E - DESIGN STEPS FOR MULTI-PHASE PROCESSOR 
v 
84 
84 
84 
87 
93 
94 
98 
. 1 01 
. • 103 
. . 103 
l 04 
107 
. . ll 0 
. 111 
• • 122 
130 
.. 135 
139 
LIST OF TABLES 
Table 
I. Limits on the Size of Algorithm 
II. Effects of Tenfold Speed-Up .. 
vi 
Page 
31 
32 
LIST OF FIGURES 
Figure Page 
1. Problem Flow Chart . . . . . . . . 4 
2. Approach Flow Chart . . . . . . . . 5 
3. The Unger Computer System 8 
4. The SOLOMON Computer System . . . . 12 
5. The ILLIAC IV Computer System 15 
6. The Cannon Computer System . . . . . . . . 19 
7. The General Electric Processor . 21 
8. SIMD Architecture 24 
9. MIMD Architecture . . . . 25 
10. Heat Sink Dip for LSI Device 44 
11. Eight Bit Multiply Algorithm 46 
12. Time and Power Equations 58 
13. Time and Area Equations . . . . . . . . . 59 
14. Integer Solutions of Problem 60 
15. Multiply Table . . . . . . . . . . . . . . 64 
16. Circuit for the Parallel Storage of Data with Least Bit Lines 68 
17. Floating Point Multiplier . . . . . . . . . . . . . . . 70 
18. Floating Point Multiplier with Look-Ahead Unit . . . . . 78 
19. Processor Circuit with Two Processor Elements 80 
20. Floating Point Multiplier with Interleaving Design 82 
21. System Model Diagrams . . . . . . . . . . . . . . . . . . 85 
vii 
Figure 
22. The Total Signal Generating t·1odel Diagram ..... 
23. Plot of the Equations for the Kalman Filter Circuit 
24. Integer Solutions to Kalman Filter Circuit 
25. Land and Diog Output Data 
26. Data Flow for the Kalman Filter 
27. Circuit for the Kalman Filter .. 
28. Mixed Integer Programming Logic Diagram 
29. Matrix Computations for the Q Matrix .. 
30. Multi-Phase Processor Design Flow Chart 
viii 
Page 
88 
96 
. . • • • 97 
99 
100 
102 
116 
136 
• . . . . 141 
CHAPTER I 
PROBLEM DEFINITION 
Introduction 
Significant alterations in engineering formulations and applications 
have emanated from the evolution of the digital computer. This has insti-
tuted the development of expensive machines of a general purpose structure 
capable of diverse operations. By nature, these instruments are large, 
slow, exorbitant, and complex. In later evolution, computer design tech-
nology produced faster, more effic-ient systems, but the units remained 
large and costly. The need for smaller, faster computers became apparent 
early in the computer age. This need instigated the use of special pur-
pose computers, small in physical size, low in overall cost, and fast in 
execution time. The drawback to such special machines was the initial 
cost and limited application of the system. An example of a limited sys-
tem of this design is the digital differential analyzer. 
Neither large-scale systems nor small special systems appear to be 
suitable for real time applications such as vehicle navigation, signal 
processing, and digital filtering. To meet the requirements necessary 
to compute these algorithms efficiently, an alternative philosophy has 
emerged that utilizes the particular mathematical structure of a given 
class of problems to generate the computing system design. By designing 
the computer to take advantage of the structure of the problem, classes 
1 
2 
of problems possessing similar characteristics may be processed effi-
ciently in a special-purpose machine. 
A group of problems whose mathematical structure can be used to gen-
erate the design criteria of the machine is the class of problems solved 
with vector, matrix operations. This problem structure suggests an array 
orientated machine capable of parallel operation. In the majority of de-
signs this organization has resulted in an array processor composed of 
identical processing elements with an effective interconnecting structure. 
Some examples of this class of problems are recursive linear filters, 
vehicle nagivation, phased-array radar control computations, and sonar 
receiving array data processing. 
The design and implementation of such a computing apparatus will be 
affected by the recent advances in integrated circuit technology~ The 
development of large scale integrated circuit technology has prompted the 
fabrication of complex digital processors on a single substrate. The 
current advances in integrated circuits, as well as possible future ad-
vances, must be taken into account in the design of a computer system. In 
the next chapter a survey of array structured systems research, as it 
pertains to the design of special organized computers, is presented. 
Prob 1 em and Approa.ch 
The designs of special-purpose computer systems for the evaluation 
of vector, matrix products have thus far been constructed using arrays of 
identical processor elements functioning at a common cycle time. These 
machines are composed of N2 processing units, interconnected by some type 
of bus structure and manipulated by a controller. In the event that the 
array to be processed is of dimension N2 or less, maximum throughput is 
3 
maintained. However, in situations resulting in an array of dimension 
2 2 greater than N or less than N , the efficiency of machine operation is 
reduced due to processor idle time. Processing elements may be added to 
the array to meet larger array requirements or removed to match smaller 
arrays, resulting in the dimensions of the data array and processor array 
being made equal. 
The primary restriction of this architecture is illustrated by plac-
ing a time, power, and size constraint on the system. As the dimension 
of the data array is increased, the design quickly overruns the con-
straints. In this situation, if the number of processing elements is 
made equal to the data array by an integer multiple, efficiency can again 
be attained. This is true only if the processor operation time is fast 
enough to perform all the computations in the required time frame. In 
most cases it is impossible to provide a processor array related to the 
data array by an integer multiple. 
It is the concern of this research to obtain an optimal architecture 
capable of computing vector, matrix operations. The architecture will be 
optimized to cost and constrained to time, power, and circuit size. A 
flow chart of the problem is shown in Figure 1. The approach will employ 
the use of two or more processors that operate at different cycle times. 
It is evident that processing speed is directly related to power and cost, 
with an inverse relationship to circuit size. A design founded on these 
relationships lends itself to linear integer programming optimization 
techniques. Optimization based on cost of the array elements will result 
in an efficiently designed computer capable of adhering to time, power, 
and circuit requirements. The approach flow chart is shown in Figure 2. 
The results of this study will produce an algorithm to follow in the 
4 
START 
~ .,. 
UL TILIZE: 
--
DISTRIBUTED ARCHITECTURE 
-~ , 
DESIGN FOR VECTOR,MATRIX 
PRODUCT TYPE PROBLEMS 
~ ,. 
OPTIMIZE EACH STAGE OF 
TO COST,CONSTRAINED BY 
TIME, POWER, AND SIZE 
.... 
, 
APPLY TO LINEAR RECURSIVE 
FILTERING PROBLEMS 
Figure 1. Prob1em Flow Chart 
5 
START 
+ 
CLASS OF PROBLEMS: 
VECTOR, MATRIX PRODUCTS 
~ ,. 
STUDY ARRAY PROCESSING 
DISTRIBUTED 
ARCHITECRURE 
~ ,. 
DEFINE DESIGN CRITERIA 
USING DEFINITIONS OF 
ARRAY PROCESSING SYSTEMS 
~ , 
OPTIMIZE DESIGN TO COST 
PARAMETERS CONSTRAINED: 
TIME, POWER, AND SIZE 
• DESIGN PROCESSORS OF THE 
ARRAY TO MEET THE DESIGN 
CRITERIA OF THE SYSTEM 
• APPLY STUDY TO KALMAN 
FILTER TYPE PROBLEMS 
Figure 2. Approach Flow Chart 
6 
design of computers to perform vector, matrix computations. This proce-
dure will yield an optimal structure capable of accomplishing the desired 
computations in a specified time period, optimized to cost and constrained 
to power, time, and circuit area. A second product of the work will gen-
erate the interconnecting circuits necessary to interleave the computed 
data to the proper accumulators during the calculation process. In the 
course of this research, two other conclusions will be reached. In the 
area of floating point hardware, a maximum throughput structure will be 
illustrated. Further, a design of the most efficient bus struriture of 
transferring the required data will be analyzed. This work applies sys-
tem theory techniques to the heretofore "black magic 11 solution to the 
design of digital systems. 
CHAPTER II 
SURVEY OF DISTRIBUTED PROCESSOR SYSTEMS 
Introduction 
A study of available literature on special computer systems produces 
a number of distributed processor-based designs. Several pertinent com-
puter structures are discussed in the following sections. 
The Unger Computer 
One of the earliest examples of a distributed architecture computer 
was proposed by Unger (1). This system used a stored program to handle 
specific problems by directly processing information in planar form with-
out format conversion or scanning operations. The structure of the de-
vice lends itself to handling pattern detection problems. 
The structure of the Unger system is illustrated in Figure 3. This 
computer uses a master control unit and a rectangular array of processor 
elements. Each processor element can communicate with the four adjacent 
elements and receive commands from the master control. The controller 
is composed of a random access memory to store instructions, decoding 
circuits, and a clock. Commands from the controller are generated in 
parallel to all processing elements, but individual processing elements 
are not addressable. Programming is accomplished with 14 assembly lan-
guage instructions that are executed by the controller. 
7 
MASTER 
CONTROL 
MODULES OF ARRAY 
PE 
--, 
PE PE 
PE PE 
PE = PROCESSING ELEMENT 
FiJure 3. The Unger Computer System 
8 
PE 
Each processing element consists of a single bit accumulator, some 
associated logic, and a small random access memory. Inputs to each pro-
cessing element consists of control lines from the system controller and 
links to the accumulators of the adjacent elements. 
A branch on accumulators equal zero is accomplished in the control 
unit by using a logical adder to evaluate the inputs to the controller 
from the accumulator of each processing element. This instruction func-
tions in the same way as the conditional transfer used in conventional 
computers by causing the control unit to skip the next instruction when 
zeros are detected in the accumulators. This transfer instruction com-
poses the only decision dependent command utilized by the machine. 
9 
The Holland Machine 
A computer system organization has been described by Holland (2) 
that places control at the processor element level in the system. This 
system organization is in direct contrast to the central control concept 
proposed by Unger. 
The concept of the Holland machine provides a basis for investigation 
of the theory of automata and computability. Holland's system consists of 
a two-dimensional array of identical processing elements, with each ele-
ment containing a storage register, routing logic, and auxiliary regis-
ters. During any given machine cycle, a processor element is either 
active or inactive. If the element is active, it decodes the contents of 
its storage register as an instruction and proceeds to execute the opera-
tion. Following the execution of the instruction, the processor element 
passes its active status to the next element, which may be any adjacent 
processor element in the array. Using this concept, sequences of 
10 
instructions are scattered throughout the array of processor elements, 
with an arbitrary number of instructions being executed at any given time. 
There are three phases that compose the operating cycle of this sys-
tem. During the first, processor element storage registers may be set to 
values introduced by external sources. In the next phase, all active 
elements determine the address of their operands by logically enabling 
data paths. The last phase consists of the execution of the instruction 
in the storage register. 
The disadvantage of this machine appears to be the difficulty of 
programming in an efficient manner that will allow a large number of 
processor elements to be active during a machine cycle. Also, it is 
necessary to use a massive amount of hardware to solve a reasonable com-
putation problem. 
The important contribution of this work is the development of an 
array of locally controlled identical processing elements. The hinder-
ances of this approach are the extensive amount of hardware utilized and 
the programming difficulty. 
The Comfort Machine 
The array-structured system based on the concept of local control 
was further studied by Comfort (3). This study culminated in a modified 
Holland machine with a fixed-size rectangular array of processing ele-
ments. Processor elements are composed of two relatively independent sec-
tions: the control section and its memory, and the communication section. 
The arithmetic units are placed beside the array of processor elements in 
this configuration and perform all mathematical and logic operations. 
This computer contains no central control unit; therefore, each processor 
element executes its own program once it is enabled. The execution of a 
set of instructions causes the enabling and disabling of successive pro-
cessor elements. 
Comfort's computer was designed to provide some improvements to the 
Holland computer, which are listed below: 
1. System is easier to program by several orders of magnitude. 
2. Machine size is reduced by a factor of five. 
3. Utilization of hardware is improved by a factor of three. 
The drawback to Comfort's system is that only one program sequence 
per arithmetic unit can be operated concurrently. 
The SOLOMON Machine 
11 
The SOLOMON (Simultaneous Operation Linked Ordinal Modular Network) 
system is a distributed architecture computer introduced by Slotnick, 
Borck, and McReynolds in 1962 and later revised in 1966 (4). The archi-
tecture was conceived to satisfy a particular class of problems and adhere 
to current needs in computing capabilities. The primary purpose of this 
device was to implement matrix operations and computations. This class of 
problems consists of linear systems analysis, matrix calculations, and 
solutions to systems of ordinary and partial differential equations. 
Figure 4 illustrates the construction of the machine and its three 
major units. The network control unit (NCU) is first and provides the 
central control of the machine. The NCU is composed of at least one arith-
metic and control unit, and is expandable to a multiple unit configuration. 
An array of processing elements (PE) in a 32 x 32 structure makes up the 
second major unit. The processor configuration is designed to allovv mod-
ules of 256 PE's with the associated memories for each to be added or 
12 
CENTRAL CONTROL PROGRAM 
STORAGE 
BRANCHING LEVELS 
PE 
----... 
PE 
-
. 
.----
PE _.... PE 
, 1 , , 
~ PE .___... PE 1-- ____. PE r- _ .. PE 
I 
PE __.,._ PE 1- _. PE 1- ____.... PE 
lr 
~ . 
PE .. PE PE PE 
PE = PROCESSING ELEMENT 
Figure 4. The SOLOMON Computer System 
removed from the unit without design alterations. The third major unit 
is the input-output unit (IOU) which is comprised of five modules of 32 
data channels, with each channel acting as a separate input-output link. 
13 
The processing elements of the computer are identical and each pos-
sesses complete arithmetic capacity. An ind·ividual process·ing element 
has associated with it two memories with 4096 bits of storage (expandable 
to 16,384 bits) in each memory. A processing element can perform serial 
logic and arithmetic operations and can communicate serial data to the 
four adjacent elements in the array. The elemental conclusions of this 
study reveal the adaptability of a machine that utilizes identical pro-
cessing cells under a central control and the capability of serial com-
munication with the four nearest adjacent elements. 
The Gonzalez Iterative Computer 
A multilayer iterative circuit computer (ICC) has been proposed by 
Gonzales (5) and is an improvement on the work done by Unger and Holland. 
The architecture of this computer provides the capability to solve prob-
lems involving spatial relationships between variables. The processing 
elements in this architecture are placed in three stacked layers. The 
layers are identical and consist of a program layer, a control layer, and 
a computing layer. The data and instructions are stored in the program 
layer; the control layer performs the decoding of instructions that are 
executed by the computing layer. The programming sequence is similar to 
the Holland computer in that each instruction specifies the processing 
element that contains the next instruction. Pipelining allows the con-
trol and programming layers to work on the next two instructions during 
the execution phase. 
14 
The elements in the three planes of M x N modules are identical and 
are allowed to communicate by means of control lines. The internal archi-
tecture of the elements is composed of an accumulator, a register, a de-
coder, and a number of switching matrices used to interface the decoder 
to the data and control lines. As with other computers of this type, the 
programming is difficult and the hardware is inefficient. The major fea-
tures of the Gonzales design that are of further interest are 
1. The path connecting method that retains the time access features 
of a common bus computer, while allo\'Jing simultaneous operation of other 
paths in the system. 
2. The complete separation of control signals from the data flow. 
3. Three-phase operation, with each phase active simultaneously on 
each layer, but executing different instructions. 
The ILLIAC IV Computer 
A distributed architecture system called the ILLIAC IV was intro-
duced by Barnes, Brown, Kata, Kink, Stokes, and Slotnick (6). The com-
puter is a continuation of the work done on the SOLOMON computer and is 
employed to implement matrix, vector computations. 
The design of the ILLIAC IV system is illustrated in Figure 5. The 
processing elements (PE) are placed in four arrays of 64 elements each 
with one control unit for each array. The function of the control unit 
is to decode instructions and control the 64 elements in array which it 
is designed to manipulate. The operations of the four arrays can be com-
bined to perform multiprocessing or single processing operations, all 
under control of one program. The system program is stored in a general-
purpose computer, a Burroughs 86500, that is responsible for loading the 
-
-
PARALLEL 
ACCESS 
DISK 
.-tl 
-
.-tl 
-
-
-
--
-
-
-
.. .,... 
I I 
I/0 SWITCH 
...... 
-
--
-
;.,.. REAL TIME LINK 
GENERAL 
PURPOSE 
COMPUTER 
Figure 5. The ILLIAG Computer System 
15 
arrays and controlling the configuration of the system, and outputting 
the data. Backup memory for the array is provided by disk in a parallel 
access configuration that is directly attached to the ILLIAC IV system. 
The architecture of the ILLIAC IV consists of four SOLOMON arrays 
of 64 processors each to provide 256 processor elements. A processor 
element in the system can perform 240 nanosecond addition and 400 nano-
second multiply operations on a 64-bit operand. The processors are each 
constructed of 104 ECL gates and a memory vJith 240 nanosecond delay and 
a 2K word configuration. 
The like units of the ILLIAC IV are constructed to be interchange-
able, as are the power supply parts. Trouble-shooting is thus made eas-
ier and down-time is reduced. 
The MCB Machine 
The Modular Computer Breadboard (MCB) was introduced by the NASA 
Electronics Research Center (7). The architecture is in the form of a 
modular system that can be reconfigured to operate as a distributed net-
work of processors. The system can be laid out in the form of columns 
16 
of processor elements that may be unplugged if not required in the com-
putation. Each individual processor may be operated independently of the 
others or in conjunction with others, if required. 
Four modules compose the structure of the system: a memory, control 
unit, arithmetic unit, and input-output unit. Alterations to the config-
uration are accomplished by the configuration control unit and the con-
figuration control switches. Prime importance to the configuration 
capabilities is the use of triple-redundance to provide extra reliable 
17 
operation. The modules of this computer are constructed of LSI circuits 
and emphasis is placed on plug-in type units. 
Berkeley Array Processor 
The Berkeley Array Processor was introduced in 1970 by Dere and Sak-
rison (8) to be used as a general-purpose system. The processor will 
efficiently perform the operations of convolution, correlation, recursive 
filtering, matrix multiplication, and fast Fourier transform. 
Dere and Sakrison's system functions as an input-output device, oper~ 
ating in conjuction with an IBM 1800 computer. In this design, the pro-
gram and data are stored in the IBM 1800 and accessed under control of the 
array processor. Arrays of data stored in the IBM 1800 are transferred to 
the 104-word shift register memory of the array processor via a pseudo two 
channel data link. The structure of the array processor consists of shift 
register memory, an array index unit, arithmetic section, accumulator, and 
necessary functional control logic. The clock period is 140 nanoseconds, 
and the data paths and registers are constructed to accommodate 16-bit 
words. 
Instructions for the processor are composed of 17 operations capable 
of handling complex data processing. The majority of the instructions 
are utilized to control the arrays of data and to perform bookkeeping 
operations on input and output information. The input-output operation 
of this processor introduces the concept of storing data in a larger com-
puter and processing the data in a special central processing unit de-
signed specially for the purpose of array structured data. This system 
uses only standard design techniques in its construction and provides 
some significant logic innovations. 
18 
The Cannon Computer 
Efficient performance of matrix operation was the main design con-
straint of the array processor proposed by Cannon (9). The system was 
intended for use in implementing algorithms utilized in linear recursive 
filters and to extend the designs of the SOLOMON and ILLIAC IV computers. 
The structure of the system, shown in Figure 6, consists of a two-
dimensional square array of processor units and a global control unit. 
Control of processing elements is maintained through the use of parallel 
control lines to each of the identical processor elements. Besides the 
control logic, the global controller has arithmetic hardware and a large 
data storage capability. This design allows the control unit to not on·ly 
decode and control execution of instructions, but further allows for data 
processing to take place in the controller. 
The processing elements are identical in structure and consist of 
sixteen 32-bit words of memory, a floating point adder-subtracter, float-
ing point multiplier, and logic to control the data flow. The design 
calls for all processing elements to receive identical control signals 
and perform the same operations during each cycle of computation. The 
matrix data is stored in the memory of the processor element array in the 
same (i, j) element location that the data holds in the data array. Pro-
cessor elements in the array are interconnected by singular horizontal, 
vertical, and diagonal data lines that allow special data transfer oper-
ations to be implemented. These special operations are broadcasting, 
rotating, skewing, and transposing the data matrix. 
The structure proposed in this design yields a significant reduction 
in processing time over conventional machines due to its parallel processing 
19 
GLOBAL • 
• COLUMN REGISTERS 
CONTROL • 
-
0 0 • 
• • • a • • • e • • 
, 
-
D D· • • • D • 
• 
• D D· D • • • • ROW • 
• REGISTERS • 
• 
• • 
• • • • 
• • • • ARRAY 
• • • • 
• • • • 
• D D D • • • • • 
Figure 6. The Cannon Computer System 
capability. A disadvantage to this scheme is the need of N2 computing 
elements to compute data in an N x N dimension array. The system can 
handle efficiently an array that is of dimension N x N, but for larger 
or smaller arrays, it appears to be less effective. 
The General Electric Matrix Processor 
20 
This system is an implementation of the Cannon computer designed and 
built by Moyer, Rice, and Fifolt in 1977. The controller used was a Z80 
microprocessor and the processor element consists of hardware floating 
poing units constructed in an 8 x 8 array. This architecture is shown in 
Figure 7. 
Real time linear recursive problems are the prime purpose of the 
machine, and in particular the Kalman filter algorithm is easily handled 
by the system. In its full operational state this unit requires one 
processor element for each element in the data array. Increasing the 
capability of the system requires the addition of more processor elements. 
The techniques utilized in this system are of interest since the im-
plementation was realized using LSI circuits. There are 400 integrated 
circuits making up the system, and it operates at a 60 watt power level. 
The system is capable of computing all the matrix functions commonly 
found in real time control and signal processing applications. Complex 
matrix algorithms may be computed using the microprogram capability of 
the system which allows it to function in time limits that cannot be met 
by conventional computers. 
Summary 
The distributed architecture design techniques proposed over the 
ROW 
MEMORY 
COLUMN 
ARITHMETIC 
ARRAY 
VECTOR 
GAINS 
I COLUMN OUTPUT 
f---1-o ROW 
OUTPUT 
MICRO· 
COMPUTER Z 80 
MEMORY SEQUENCER 
Figure 7. The General Electric Processor 
21 
22 
last 20 years have been presented, along with advantages and disadvantages 
resulting in these systems. A growing need for small special-purpose com-
puters has also been discussed and various problems such as filtering and 
navigation have been pointed out to have array structures. It has been 
deemed advantageous to base the design of the hardware upon characteris-
tics of the problem to be solved in situations where a group of problems 
have mathematical structural similarities. Research in this area is 
being caried out in both industry and university environments with both 
entities placing strict concern on advances in LSI technology. The ulti-
mate culmination of these studies will be algorithms capable of designing 
fast, low cost, and efficient systems able to handle the myriad of prob-
lems that face the engineering community. 
CHAPTER I II 
DEFINITIONS OF DISTRIBUTED ARCHITECTURE 
AND PROCESSING 
Introduction 
Distributed or parallel processing is defined as processing two or 
more portions of an algorithm by two or more processing units during the 
same interval of time (10). This processing takes place at the task, 
subtask, instruction stream, or data set level. The result of this defi-
nition generates a multiple processor system organization in the hardware 
(11). Once the hardware is defined as a multiple processor system, it is 
further divided into two processing structures. These structures are 
termed single instruction multiple data (SIMD), shown in Figure 8, and 
multiple instruction data (MIMD) (12), shown in Figure 9. 
A definition based on the organizaiton of the hardware processors 
alone does not define the entire concept of distributed architecture (13). 
The definition must consider the concepts under which the distributed pro-
cessors are allowed to communicate and share memory space. Further, the 
programs to be executed by such a distributed system should make maximum 
use of the parallel capabilities of the computer. 
SIMD Architecture 
This architecture is sometimes referred to as a parallel processing 
system (14) that uses one control unit to fetch and decode the program 
23 
Data and 
instructions 
Figure 8. 
• . 
. 
Procenor-to-proceuor 
i-"''llf---lc--1~ interconnection 
Memory 
bus 
·networlc 
To 
input/output 
SIMD Architecture 
24 
Shared 
m~mories 
Processors 
with 
private 
memory 
and 1/0 
Fi gtire 9. 
·• 
. ' 
.P/M 
intercorinection 
·:network 
·' 
PEWO· 
- inter"connectjon 
~·. netwprk 
MIMD Architecture 
System bus 
and assoCiated 
interlace logic 
, ... ·.•. 
·Interrupt 
signal .. 
interconnection 
. network 
Input/output 
channels 
25 
instructions. The instructions may be executed in the central control 
unit or they may be sent directly to subordinate processors in the sys-
tem. 
26 
The SIMD systems are further divided into three subclasses of this 
architecture and called array processors, processing ensembles, and asso-
ciated processors (10). The array processor is designed to allow the in-
structions to operate on vectors of data at the same time with the control 
unit having only limited capabilities. To implement the processing en-
semble, the control unit must be a complete computer and the processing 
units are only allowed to communicate by passing data through the control 
unit. The last subclass is that of associative processors, which are 
designed to allow subelement processors to access and operate on data 
only by its content and not by storage locations (10, 11, 12). 
Array processors are considered to be the most applicable subclass 
o~ SIMD architecture from cost and throughput criteria (10). The need 
for maximum throughput motivates the design of array processors since 
their tremendous throughput is accomplished by simultaneous operation of 
individual processors on different streams of data. Three criteria must 
be met to allow an array processor its maximum parallel capabilities (10). 
First, all calculations should be described by vector instructions which 
cause large amounts of data to be manipulated simultaneously by one oper-
ation. Secondly, there must be high speed data paths between processor 
elements; and last of all the block of data that is to be processed during 
one time interval must also be fetched from memory in one time interval. 
Failure to meet these criteria will result in the system tending toward 
serial operation and a major reduction in throughput (10, 12, 13, 14). 
27 
MIMD Architecture 
The vector operations of SIMD are not used in an MIMD system, where 
parallelism is obtained by performing different operations on individual 
data sets in a given time interval (10). The results of the separate 
operations are combined to form an end result to the computations. The 
primary criteria for efficiency of an MIMD system is the design of proper 
synchronization of the individual subsystems and allocation of the pro-
cessing in an effort to balance the computational load on the system. 
The SIMD system operation does not face the same synchronization problem 
since each individual processor of like type is doing the same operation 
concurrently. The r-HMD architecture may be divided into two subclasses, 
which are called the multiprocessor system and distributed system (10, 
12, 13). The multiprocessor system uses one controller as a master to 
allocate the slave processors individual tasks as requests for these 
tasks are required. The slave processors may or may not be capable of 
all doing the same functions to the incoming data stream (12, 13, 14). 
This multiprocessor system may be used in applications of general 
purpose computations, with each slave processor performing general tasks. 
As new task requirements are generated, an idle processor is used to 
perform the operation or to pick up another processor's \<Jork if a fai 1 ure 
occurs in some other subordinate unit {10). The system may be altered 
to allow different processors to do separate tasks, but as new programs 
of the same type are generated, they must wait for a specific processor 
to become available. In certain environments 6nly a few of the slave 
processors would be used while the remainder go idle (10, 13, 14). 
28 
MIMD architecture has a second subclass called a distributed system 
(12) which is composed of multiple processors designed to do specific 
functions in a partitioned system. With this architecture the subordinate 
processor systems may be located in one area, or separ·ated by large dis-
tances and connected by communications networks. The algm~ithms to be 
executed on such a system must be known prior to operation of the system 
so that software may be segmented into dedicated programs to drive the 
individual processors. Such a diverse system, however, reduces interac-
tion between sybsystems and makes debugging of individual systems less 
complicated (10, 12). 
Communication in distributed systems consists of messages sent be-
tween processors or blocks of data being transferred from one unit to 
another by way of shared peripherals or serial communication channels. 
A further specification of this subclass of architecture may be drawn 
from the requirement that the memory of the system not be shared by any 
of the sub-units (12). From these definitions it is evident that the 
disadvantages of such an architecture are (a) load characteristics are 
difficult to determine for general program usage, (b) poor parallel usage 
of subprocessors, and (c) general difficulties in controlling the system 
if expansion of the system is desired (10, 13, 14, 15). 
Coupling 
The MIMO subclass of multiprocessor systems has been further classi-
fied by the amount of memory shared between its subprocessors. Systems 
are referred to as being tightly coupled or loosely coupled. A tightly 
coupled system is designated as one that is subjected to a strict control 
scheme implemented in self-contained hardware (16). Tightly coupled 
29 
systems have been defined by Bowra (11) to include processors which trans-
fer data through shared memory. However, this definition is not followed 
by Weissberger (10), and Enslow (13) in their concepts of multiprocessor 
systems. 
Loosely coupled systems are considered to be systems that have no 
interaction between processor programs but do allow memory to be shared 
(10, 13). Systems of loosely coupled processors require adequate commu-
nication and memory sharing to reduce the ramifications of subsystem 
failures. These results are obtained by alloweing dynamic reconfiguration 
of the operating system in the event of subsystem failure. Communication 
capabilities will also reduce the interaction of processors during simul-
taneous attempts to acquire data in shared locations. Such control is 
accomplished by allowing a priority criteria to be established (10, 11, 
13, 14). 
Definition of Time and Space Complexity 
Once an algorithm is in a form ready to be executed by a computer, 
two questions must be answered concerning its complexity (17): 
1. What is the extent of memory space necessary to execute the 
algorithm? 
2. What is the time required for execution of the algorithm? 
To attempt to answer these two questions it is desirable to utilize 
some algorithm evaluation criteria. The object is to determine the de-
pendence of the time or space required to solve the problem as it grows 
in order and observation. It is necessary to associate with the algor-
ithm an integer, called the size of the problem, assumed to be a measure 
of the input data and order of the system. 
30 
The time required for solution of an algorithm may be expressed as 
a function of size of the problem, and called the time complexity (17). 
The limiting behavior of the complexity as size increases is called asymp-
totic time complexity (18). An analogous definition can be made for space 
complexity and asymptotic space complexity (18). 
The asymptotic complexity of an algorithm is said to determine the 
size of the system which can be solved by the algorithm. An algorithm 
may process incoming data of size N in time TN2 for some constant T. 
From this it is seen that time complexity of the algorithm is order N2. 
A better definition (18) may be generated by stating that a function g(N) 
is of order f(N) if there exists a constant C such that g(N) ~ Cf(N) for 
all but some finite set of non-negative values for N. 
It is suspected that increases in throughput of data brought about 
with each new generation of digital computers would decrease the concern 
over efficient algorithms or more efficient architecture. However, as 
computers increase in throughput, and bigger problems can be solved, it 
is still the complexity of the algorithm that limits the increase in 
problem size that may be handled by a faster computer. 
To further illustrate this time complexity definition, consider the 
case of five algorithms to solve the same problem, where each has a dif-
ferent time complexity (19): 
Algorithm Time Complexity 
Al N 
A2 N log N 
A3 N2 
A4 N3 
A5 2N 
31 
The definition of time complexity used here is the number of time 
units requried to process an input data set of size N. For example, one 
unit of time equates to one millisecond; therefore, A1 may be processed 
in one second with an input size of N = 1000 and A5 may be calculated in 
one second, but the input size is N = 9. Calculating the size of N rela-
tive to one second, one minute, and one hour will give the values shown 
in Table I (18). 
TABLE I 
LH1ITS ON THE SIZE OF ALGORITHMS 
Time One One One 
Algorithm Complexity. Second Minute Hour 
Al N 1000 6 X 104 3.6 X 106 
A2 N log N 140 4893 2.0 X 105 
A3 N2 31 244 1897 
A4 N3 10 39 153 
A5 2N 9 15 21 
As computer components become faster, and if a ten-fold increase in 
calculation speed is assumed, it is possible to calculate another set of 
results for Table I. Table II (18) gives the size of a problem which may 
be processed as a result of a ten-fold increase in data processing speed. 
TABLE II 
EFFECTS OF TEN-FOLD SPEED-UP 
Time Problem Order Problem Order 
Algorithm Complexity Before Speed-Up After Speed-Up 
Al N 11 10 ,, 
A2 N log N 12 Approx. 10 12 
A3 N2 13 3.16 13 
A4 N3 14 2.15 14 
A5 2N 15 15 + 3.3 
The results of Table II illustrate, for example, that algorithm A3 
may handle data 3.16 times larger with the ten-fold increase in computer 
speed, but A5 is increased by only 3.3 added to its previous size. 
32 
It is now thought that the rate of increase in computational through-
put due to technology improvements is declining, which suggests that fur-
ther increases in throughput will only result from better algorithms or 
systems architecture. Assuming that an algorithm. is fixed, the most 
effective method of increasing the throughput is to use distributed arch-
itecture, or in other words, process the data in parallel whenever possi-
ble. Most conventional computers operate in a strict sequence with only 
one operatinn taking place at a given time, called serial processing. 
The distributed architecture approach replaces a computation requiring N 
steps by m independent subcomputations occuring simultaneously. Not all 
algorithms adapt well to parallel processing, so overall results of a 
fixed parallel architecture as related to an open class of algorithms 
is not thought possible (18). 
Summary 
33 
The intent of this chapter has been to introduce the distributed 
architecture configurations and tabulate some of their advantages and 
deficiencies. The background literature on the dilemma of vector, matrix 
structured computation has always been directed towards the use of single 
instruction, single data path architecture. It is pointed out in this 
chapter that the idiosyncrasies of array processing are far more appli-
cable to this research than the other possible configurations. 
Having once analyzed the architecture definitions, the concept of 
time complexity is viewed. This time complexity definition will serve 
as a primary constraint to the optimization problem and provide a basis 
of design for real time computations. The steps in computing a vector, 
matrix product will provide a fixed algorithm to be used in solving the 
problem. By fixing the algorithm, the architecture will be allowed to 
vary in order to meet the time complexity and other constraint criteria. 
CHAPTER IV 
OPTIMIZATION OF DISTRIBUTED ARCHITECTURE 
Introduction 
In the interim of the initial research and conclusion of the litera-
ture survey, array processing for·mulas were employed to realize vector, 
matrix multiplications and similar mathematical operations. In each in-
stance, the definition of array processing was closely followed and led 
to a computing structure composed of elements exactly equal to the dimen-
sian of the problem array. 
An example of a matrix, vector multiplication problem is: 
yl 
Xn x12 xl3 xl4 xl5 y2 
x2l x22 x23 x24 x25 y3 
x31 x32 x33 x34 x35 
y4 
y5 
An array processor element configuration is: 
PE(ll) PE(l2) 
PE(21). 
PE(3l). 
. PE(l6) 
PE(26) 
PE(36) 
CONTROL 
•---------[ SYSTEM ]---------
34 
One processor 
element (PE) 
for each term 
in the matrix 
35 
The solution of an MxN matrix, vector product requires (M·N) multi-
plications and M(N-1) additions. Verification of this conclusion is pro-
vided by Aho, Hopcroft, and Ullman (18) in a general theorem of matrix 
multiplication algorithms. Their theorem illustrates that, in general, 
there exists no procedures to diminish the aggregate multiplications and 
additions fundamental to the solution of a vector, matrix product, such 
as XA, where X is a matrix of order M x N and A is a vector of order N x 1. 
M times N multiplications and M(N-l) additions represent closure of the 
algorithm and approaches the assumption of a fixed procedure of lowest 
order. With this assumption made, attention is then concentrated on 
determination of an architecture, variable in structure, optimal in cost, 
and constrained by time, power, and size. 
The literature study centers around an array mechanism composed of 
M x N operational units, interconnected to all ow each unit to compute one 
partial product of the matrix, vector product. This design criteria 
stipulates that as the dimension of the problem matrix increases or de-
creases, the dimension of the computing mechanism must alter to equate 
to the order of the problem. The variations in array design are appli-
cable to machines of an unlimited category; however, if real time design 
constraints are imposed on the structure, alterations must be generated 
to correct the design. 
Array Processing Limitations by Definition 
The definition of SIMD processing stipulates that individual proces-
sors must manipulate divergent data streams simultaneously. A conjectur-
al criterion predicated by the Array Processor subcategory necessitates 
the concurrent execution of the same sequence of instructions under one 
direct controller unit. Further definition infers that the data paths 
interconnecting the processing elements be of a high speed parallel 
nature, that all data fetched in one time period be processed in one 
36 
time period and stored in one time period. A listing of the design para-
meters for an array processor is as follows~ 
1. Processors should process different data streams concurrently. 
2. Processors should execute the same instruction concurrently 
under one co~trol data stream. 
3. All data paths should be high speed system. 
4. Individual data blocks must be fetched in one interval T1, pro-
cessed in one interval T2, and stored in one interval T3. 
5. The number of operations occurring simultaneously in individual 
modules must be maximized. 
Design Limits of Array Processor 
If the criteria and structure of the array processor is utilized in 
the presence of constraints such as time, power, cost, and circuit size, 
system performance is seriously degraded. The imperfections pertinent to 
the circuit design of array processors are best illustrated by attempting 
to design a processor array for a constrained problem. Time, power, and 
size restrictions must initially be designated and thereby provide limits 
to the choice of hardware capable of performing to the specifications. 
Array Processor Design Example 1 
Design Specifications: Data Time--1 microsecond 
System Power--10 watts 
(Multipliers) 
Circuit Size--20 square units 
Processor Chosen for 
Job (Specifications): 
Array Design Data: 
Cycle Time--600 nsec 
Power Per Unit--1 watt 
Area (Multiplier)--1.5 square units 
10 Multipliers--10 watts power 
Cycle time is less than 1 JlSec 
System Area = (10 mply)·(l .5 sq units/mply) 
= 15 sq units 
*Maximum processor array size is 10 processing elements. 
*Maximum number of matrix terms processed in one cycle is equal to 
1 0. 
X X X 
X X X 
X X X 
X X X 
X X X 
X X X 
X 
X 
X 
y 
y 
y 
y 
y 
y 
y 
Nine multipliers can be used to 
process this vector, matrix pro-
duct in less than 1 )lsec. 
Ten multipliers will not process 
this vector, matrix product in l 
llsec since two terms cannot be 
processed in this time period. 
Upon selection of necessary hardware, the maximum number of units that 
can be used to compose the machine and still ~eet the power and circuit 
37 
size constraints may be computed. At this point the capacity of the 
machine has been effectively restricted by the constraints and efficiency 
of operation can be determined. Assume the machine's largest acceptable 
array configuration is N x N. If a matrix to be processed by this machine 
is of dimension less than N, the solution is easily obtained, but some 
processing elements will be idle during the process and still require 
power. This degradation can be corrected by eliminating the idle units 
or disabling them until needed (17). 
As the dimension of the prob 1 em grows 1 arger than the N x .N array of 
the machine, two cases develop. First the computer may be redesigned to 
38 
use an array of processors of the fastest available cycle time, or just 
fast enough to compute the terms of the matrix in the required time with 
each processor computing several terms. 
Array Design Example 2 
Specifications: 
Process Hardware: 
{Specifications) 
Array Design Data: 
X X X X 
X X X X 
X X X X 
X X X X 
Data Time--1 psec 
System Power 
{Multipliers)--10 watts 
Circuit Size--20 square units 
Cycle Time--450 nsec 
System PovJer--1. 25 watts 
Circuit Size--1 .75 square units 
8 Multipliers--10 watts 
Cycle time less than l/2 A psec 
System Area= (8 mply) (1.75 sq units/mply) 
= 14 sq units 
y 8 multipliers ca.n process 16 matrix 
y 
terms in less than 1 psec if each 
y 
multiplier does two terms. 
y 
This case deviates from the array processor design criteria, since a one-
to-one correspondence of processors to terms is nonexistent. However, as 
constrained, this design variation is a possible solution to the dilemma. 
If the array in this machine is composed of N2 processing units, and the 
matrix is composed of x2 terms, then it is easily verified that if I = 
X/N, where I is an integer, the resulting efficiency of the system will 
be 100 percent, on the basis of power consumption to work. This rela-
tionship is best seen by the following example; if N = 2, then the 
machine array is composed of 4 processing units. A matrix of dimension 
4 x 4 \'/i 11 be composed of 16 terms and the computation process is executed 
in 4 cycles of the 4 processing elements with zero idle time. ~Jith X 
and N not related by an integer I, there will exist idle states during 
the problem execution time. For example, if N = 2 and X = 3, the pro-
cess wi 11 require 3 cycles of each of the 4 computing components, but 
during the 3rd cycle only 1 processor will be used and 3 will be idle. 
39 
The second case exists if slower hardware that is physically smaller 
and that requires less power is employed in the machine. The propagation 
of these slower components must likewise surpass the cycle time con-
straint, but due to its size and power advantages,. a larger array is rea-
lizable. This scheme will also result in the capability of manipulating 
a more comprehensive matrix, vector product. The boundary of the second 
case design exists at the point where the machine dimension equals the 
matrix dimension. 
Array Processor Design Example 3 
Design Specifications: Data Time--1 microsecond 
System Power 
(Multipliers)--10 watts 
Circuit Size--20 square units 
Process Specifications: Cycle Time--900 NS 
Power Required--0.5 watts 
Circuit Size--1.0 square units 
Array Design Data: 20 Multipliers--10 watts 
Cycle time less than l ~sec 
System Area = 20 square units 
X X X X y 16 slow processors can be used 
X X X X y to do the process in less time 
X X X X y than 1 11sec. 
X X X X y 
40 
A final look at the two cases reveals that by utilization of accel-
erated components, multiple cycles processing can result in improved 
performance, while retarded cycle time components similarly will result 
in expanded capabilities. These two cases illustrate some limits to the 
defined array processor design criteria and imply that efficiency is a 
function of the relationship of the machine array to the problem array. 
Limits on Vector, Matrix Cycle Time 
The initial problem placed upon the design consists of an execution 
time stipulation indicating when the results of the vector, matrix pro-
duct will be completed. In this specific situation MxN multiplications 
and N(N-1) additions must be achieved. It has been established that tech-
niques for the reduction of the number of operations of addition and mul-
tiplication are nonexistent. The only logical approach capable of improv-
ing the overall precipitancy of the calculation is to reduce the time 
delay of adder and multiplier components. Knowledge of the limitations 
of addition and multiplication logic will surface to establish the impedi-
ments on the size of the problem acceptable to a hardware processor. 
Winograd deduced the theoretical lower limits of the multiplier (19) 
and addition (20) process. In doing so, an (r,d) circuit is utilized to 
express the terms of the bound of a d-valued logic circuit possessing 
elemental fan-in of at most r and having the capacity to compute any r 
argument, d-valued logic function in a unit interval. The addition of 
two N bit operands approaches lower limitation in the binary number sys-
tem in compliance with the equation 
t ~ [logr 2N]. 
Winograd further illustrated that the theoretical lower limitation on 
multiplication delay is consistent with or slightly swifter than the 
addition limitation equation. The equation is of the form 
t ~ [logr • (N- 2)]. 
41 
Through the correlation of practical circuit delays with the theoretical 
limitations, insight in practical design can be acquired. The most pre-
valent technique for the realization of addition is carry-look ahead with 
a speed exemplified by 
S = 4[logr N]. 
In multiplication, the fastest practical realization implements multi-
plier encoding techniques combined with a wallace tree interface of carry-
save adders and culminates in a speed of 
s = 2[1og312 (N)] + 2[logr N]. 
A numerical illustration of the performance equations is produced 
by letting r = 4 and N = 16 bits. With this assumption the 1 ow 1 imi ts of 
addition and multiplication can be calculated to be 3 gate delays, where-
as the carry-look ahead adder realization possesses 8 gate delays in con-
trast to 18 gate delays for multiplication as seen in present designs. 
Existing adder logic more closely approaches the theoretical limitation 
than does multiplication due to the ideal implementation of addition in 
the binary number system. The cardinal rule emanating from the investi-
gation of addition and multiplication limitations reveals that a data 
medium that permits the lower bound of addition will disregard the lower 
bound of multiplication and vice versa. This postulate is exemplified 
by the slide rule•s capability to compute multiplication with logarithms 
42 
while being inept in addition. A similar example is the ROM look-up 
table calculation media. The ROM system is an inefficient data represen-
tation for both multiplication and addition, and in practice it provides 
comparable access time for both operations. 
Hardware Monolithic Multiplier Power 
and Size Problems 
The three most critical features of a multiplier are speed of multi-
plication9 power .dissipation, and binary word length. Word length re-
quirements are of function of data accuracy requirements, stipulated by 
the overall problem (21). In practical applications the number of bits 
of accuracy implem~ted should correspond to common existing word lengths 
of 4, 8, 12, 16, and 24 Bits. The word length indicates the bit length 
of each operand and half the bit length of the product. If the number of 
bits required exceeds the common \'Jord lengths, then cascaded configura-
tions of multiple multipliers will be interconnected to generate the 
resultant product. This implies that a fluctuating word length creats 
a direct alteration upon circuit size parameters in the circuit realiza-
tion capable of solving a given problem. A secondary concern generated 
by the word length will result in the need of some knowledge as to the 
interconnection capabilities of the hardware selected for the implementa-
tion of a large word length processor (22). Some monolithic units exhib-
it adequate speed and power characteristics but require numerous support 
units to handle expanded word lengths. This will result in an unfavor-
able circuit size characteristic that violates the circuit area con-
straint. 
43 
The present power dissipation of monolithic mulitpliers ranges from 
300 milliwatts on a 2 x 4 multiplier to 5 watts on a 16 x 16 multiplier. 
At the 5 watt power dissipation level, induction cooling techniques are 
employed since the wattage is well over practical upper bounds for a 
monolithic substrate. The thermodynamic equation that indicates the 
power capabilities of a monolithic circuit is 
T(junction) = T(ambient) + 8jA(power). 
T(junction) is the temperature of the silicone chip, T(ambient) is the 
still-air ambient temperature, ejA is the thermal resistance of the 
package, and the last term exemplifies power dissipation. Thermal re-
tardation is expressed as A°C per watt; this is stipulated as one watt 
of heat energy being required to raise the temperature by A°C. A typical 
thermal characteristic of a 40 pin chip is in the range of 30°C per watt. 
If the ambient temperature, as specified by standard military require-
ments, is at a maximum of 125°C, and the limit on silicone junctions is 
175°C, then the break point on power dissipation is calculated by 
max power= (T(junction) - T(ambient))/ejA 
max power= (175- 125)/30 = 1.6 watts. 
This is a textbook solution to power dissipation limitations of a sili-
cone substrate and serves to show that one watt is approximately the 
maximum power dissipation of an LSI device. To handle power dissipation 
in the range of 5 watts per substrate, the dip must be constructed with 
fins or other heat sinks to diminish thermal resistance to around l0°C 
as shown in Figure 10. LSI manufacturers are constrained primarily by 
cost and are reluctant to exceed the 1.6 watt power dissipation per chip 
( 23). 
3.230/ 
3.170 
x_7 1 x_s 
X_g 2 X_ 5 
x_9 3 x_4 
x_10 4 61 x_3 
x_11 s 60 x_2 
CLK X 6 59 X_ 1 
TAIL 7 58 XsGN 
0-22 8 57 RND 
0-:ll 9 56 CLK Y 
o_20 10 55 v_ 11 
o_ 19 11 54 v_10 
o_18 12 53 v _ 9 
o_17 1"3 52 v _ 8 
o_16 14 51 Y _ 7 
o_ 1 s 15 5o v_6 
GND 16 49 .. vee 
GND 17 48 .. vee 
GND 18 47 Y_ 5 
o_ 14 19 46 v _ 4 
o_13 20 45 v _ 3 
o_12 21 44 v _ 2 
o_, 1 22 43 v _ 1 
o_ 10 23 42 YsGN 
ACC 24 41 OsGN 
SUB 25 40 0_. 3 
CLK P 26 39 0_.2 
TRIM 27 38 0+1 
O_g 28 37 0 0 
o_8 29 36 o_1 
0_] 30 35 0_2 
0_6 31 34 0_3 
-L-----0=-=-c.oS'--'3~2.,_ '~...LL_ll_!_!_.l..''--.l.LJJ>--' 3 3 0 -4 
TOP VIEW I~- 0.800 __J 
,-----------! 0.010~1 
~~~g~~ ~ t llllUilllllUll~o! o.o1s ! 
r-===-1--t I 
----1~0.011 ! 0.001 0.060 
. II - :! o.o1s 
END VIEW 
Figure 10. Heat Sink Dip for LSI Device 
44 
45 
High-speed multiplication requires special combinatorial algorithms 
that simultaneously form partial products and add them in one operation. 
Each sequential bit in the partial product is determined by an AND opera-
tion of successive multiplicand bits with a single multiplier bit. This 
is analogous to the add-and-shift technique, since a zero state bit mul-
tiplicand produces a zero partial product and a one state multiplier sim-
ply duplicates the multiplicand in the partial product. The equivalent 
shift operation in a logical multiplier is consummated by the intercon-
nections of the logical adders utilized to sum the partials (24). 
A practical procedure for measuring the speed of a multiplication 
unit is to evaluate the speed as a function of logic-gate propagation 
delay, while the power dissipation is a function of the total number of 
gates. Referring to Figure 11, the propagation delay of this 8 bit mul-
tiplier algorithm is affected by the delay from A1 to B7 to s15 . This 
route is composed of 14 adder units, \'lith 4 gate delays each, with another 
gate delay for the generation of partial products, resulting in a total of 
56 gates. The total gate count is composed of 64 AND gates for the gener-
ation of partial products and from 56 binary adders, individually realized 
from 10 logic gates. The total gate count for an 8 bit combinatorial mul-
tiplier system that implements this gate structure is 624 gates (25). 
A combinatorial multiplier is significantly faster than a sequential-
type system, but vast improvements are necessary to approach the theoreti-
cal lower bound of operation. For instance, the carry bits that are 
transferred between the adders impede the process, and an alternate scheme 
called carry-look-ahead will account for the bits without addition, and 
will improve the delay time. Improvement schemes of this nature compli-
cate the structure and increase the gate count as a byproduct to their 
MULTIPLICAND!.._ 
X, x. x. X. X, X2 X, X0 
MULTIPLIER- Y7 Y6 Y• Y4 Y3 Y2 Y, Y0 ~~~~~~~~~
[ S,. S,. S., S12 S, S, 0 S. S. 
fiNAL PRODUCT 
A7 A. A, ~ A3 A2 A, A0 
86 80 84 83 82 B, 80 
c. C4 C3 C2 C, C0 
04 0 3 D, O, 0 0 
E3 E2 E, E0 
F, F, F0 
G, G0 
He 
s, s. s. S4 S3 S2 S, S0 
Figure 11. Eight Bit Multiply Algorithm 
------
47 
benefit of increased speed. The larger the gate count per processed bit, 
the smaller the word length possible per chip, if typical power dissipa-
tion limits are upheld. 
Techniques for Reducing Multiplier Delays 
Alternative techniques have been aimed towards reduction of both 
gate count and gate delay. One such implementation is designated as the 
modified Wallace Tree, and succeeds in enhancing the alrogithm by saving 
all carry bits and adding them in one step using triple input adders. A 
circuit reduction of 24 gates obtained over the carry-look-ahead scheme 
by using the modified vJa 11 ace Tree. The drawback to both methods is that 
additional logic is necessary to handle signed numbers. Further research 
has revealed techniques for accommodating the sign convention and reduc-
ing gate complexity through the assistance of encoding practices such as 
the modified booths algorithm. In implementation, this algorithm re-
quires 675 gates to process an eight multiplier and multiplicand (26). 
In the implementation of a multiplication algorithm, several tech-
niques can contribute to a reduction in power dissipation on the sub-
strate. First, computer-aided circuit design studies are employed to 
determine noncritical data paths. These noncritical routes are accept-
ably realized with slower gates that dissipate less energy. Second, the 
number of devices required for AND-OR-INVERT gates in the multiplexer sec-
tion of the circuit can be optimized. Logic functions in some cases can 
be realized with single transistors, such as the equation C=A·B. This 
particular equality is realizable utilizing the collector, base, and 
emitter of one transistor. A subsequent enhancement scheme is to match 
48 
the input threshold voltages of sequential states to reduce the need for 
translator circuits in the system. 
Extensive research into LSI design is invoking higher speed of cir-
cuit operation at comparable power levels. These advances will escalate 
the size of the problem solvable in the same time frame as before, but 
the power constraints and size problem are constant. This is exemplified 
by the latest experimental C-MOS and a new bipolar stepped electrode pro-
cess by Hasashino (Mippon T&T subsidiary) which demonstrates a 0.5 psec 
propagation delay with a 0.1 pJ power delay product. This repercussion 
constitutes an order of magnitude improvement in propagation delay over 
D-MOS and V-MOS, while still maintaining the power delay product (27). 
It has been implied that the end of the bounding improvements for 
photo-mark-generated LSI circuits will occur as the point line width 
approaches the visible light wave length. Before LSI densities can 
mature further, techniques must be conceived which allows a line width 
reduction to below 1 micron. Electron beam lithography (EBL) exemplifies 
such a capacity by projecting integrated circuit patterns directly, with-
out the aid of masks and contact printing of the substrate. Recent ad-
vances in EBL have led to the concept that this technology will mature 
rapidly within the next few years and overtake the present problems, par-
ticularly the constraint of high cost. EBL-generated transistors will 
exhibit lower power densities, due to their smaller physical size. If 
gate complexity is escalated tenfold and chip size by 4, the resulting 
dimension of the substrate will reach 12mm x 12mm, and the one million 
elements per chip level can.be approached .. The culmination to the rea-
lizable number of devices per chip will ultimately be limited by power 
dissipation and the number of input and output lines required for 
49 
suitable applications. Generally speaking, as the logic function imple-
mented on a chip becomes more complex, it becomes more specialized, fewer 
total devices are utilized, development costs inflate, and the task be-
comes uneconomical (28). 
The conclusion reached by an analysis of monolithic multipliers can 
be cataloged as follows: 
1. The number of gates necessary to implement a multiplier algo-
rithm are semi-constant from one algorithm to the next. 
2. If the gate requirements to implement an 8 bit multiplier unit 
are assumed in the range of 650 gates, then primary consideration as to 
the pO\tJer and speed of the unit is the type circuit technology utilized. 
3. It can be stipulated that to increase speed, an increase in 
power dissipation is required. 
4. Regardless of recent innovations in technology, power, size, and 
cost constraints are still applicable. 
Multiple Phase Processing 
As the technology develops and processors become less expensive, 
faster, and less power-consuming, the design constraints of speed, power, 
and circuit size will still exist. This has been evident throughout the 
transition from tubes to transistors and transistors to LSI circuits. 
With all the innovations in speed, power, and size over the last several 
decades, problem sizes have increased, requiring further consideration 
to the speed, power, and size dilemma. 
In the realm of vector, matrix operations, the effects caused by 
design constraints are functions of the problem's characteristics. If 
a 3x3 matrix and a 3xl vector are multiplied together,9 multiplication 
50 
operations and 6 additions must be performed during the time prior to the 
result being made available. The cycle time of a processor is the time 
necessary to complete one computation, and the completion time is the 
total time needed to complete the solution. The utilization of four pro-
cessors to compute the product of a 3 x 3 matrix and a 3 x 1 vector wi 11 
result in three cycle time intervals of the processors, assuming that the 
cycle time is one-third or less of the total computation time. In the 
interim of the first cycle, four terms are computed and retained; during 
the second cycle, four more terms are evaluated. The last cycle will 
contribute only one term to the partial terms necessary to complete the 
solution. During the closing iteration, all processors will be operating 
but only one will be doing useful work. 
An alternate scheme for evaluation of this vector, matrix product 
will be to bring into operation two or more processors of dissimilar 
cycle times and by so doing, alter the time, power, and size variables. 
Such a processing unit will be designated as a multiple phased array pro-
cessor. This design will conform to the array processor design criteria 
as each unit that is affiliated with a particular cycle time will be exe-
cuting the same instructions concurrently on different data streams. This 
process is equivalent to operating several array processors of different 
speeds in parallel to improve the performance of the computer and meet 
the constraints of time, power, and size. Adopting this multiple phased 
array concept in conjunction with the knov'l edge that power is directly 
related to speed and inversely related to size, invokes a situation to 
which optimization techniques can be applied. In the discussion of the 
matrix, vector multiplication problem using a matrix of dimension 3 x 3, 
a solution could have been obtained in the required time frame by 
51 
computing four terms with one fast processor concurrent with the computa-
tion of three terms using a slower processor unit. The remaining two 
terms are evaluated in the same time frame by still another processor of 
even slower speed that is capable of only two cycles in the time required 
for the fastest processor to perform four complete operations. Through 
the utilization of such a scheme of computing, the design problem can be 
solved such that the cost is minimized and constrained by time, power, 
and circuit size. 
Optimal Design With Linear Programming 
A linear program is defined as a mathematical model which is de-
signed to obtain a set of nonnegative numbers or variables which maximize 
or minimize a linear equation or object function while satisfying a sys-
tern of linear constraints. It is apparent in this situation that the 
linear formulation must consist of and result in an integer solution. 
Utilizing matrix notation, an integer program is exemplified as 
follows: 
Minimize: 
Subject to: 
P. is an integer, i = l, 2, 3, . 
1 
c. ' 1 i = l ' 2, 3, is a cost 
A.' i = l ' 2, 3, is an N x l 1 
B.' i = l ' 2, 3, is an Nxl 
term 
vector 
vector of 1 the right-hand side) 
( 4 .l) 
(4.2) 
constraints (or simply, 
52 
Pi' i = 1, 2, 3, . is anN vector of integers. 
In this application the objective function (Equation (4.1)) is the total 
cost of the array of processors used by the computer to process the 
matrix. Equation (4.2) is composed of three or more constraint equations. 
To evaluate the optimal design of the multiphase array processor, 
certain data on each processor must be obtained. 
1. Cost of each type of processor considered. 
2. Time necessary to complete one computation. 
3. Power (in watts) used to operate each type processor. 
4. Number of packages that compose each processor and number of 
pins used on each package. 
Once the hardware is acquired, the linear program equations are sub-
sequently created. Let Ci =cost of processor Pi, i = 1, 2, 3, ... 
. + C~,PN :: Z 
I i I 
C; ::: Ci+ l , i = 1 , 2, 3, . . . N. 
Let T equal total time allowed for matrix computations and T . 
c pl 
equal the cycle time of each processor Pi, 
T1 =largest integer (Tc/Tpi)' i = 1, 2, 3, ... 
The time equation will be in the following form: 
Let Pi equal the aggregate of processors of type Pi necessary to compute 
the problem if only type P1 processors are employed. 
P' = (number of elements in matrix)/T1. 
53 
Let \~i equal the power in watts necessary to operate each unit Pi. 
The power constraint equation is formed as follows: 
where WT represents the cumulative power sanctioned for consumption by 
the array hardware. WT possesses an upper and lower bound that are com-
puted using P!. The upper limit is obtained from the largest term of 
l 
the limit equation 
H.P! = W!, i = 1, 2, 3, .. 
1 1 1 
The lower limit is the smallest value W~, as acquired from the calcula-
tions. These upper and lower boundaries will bracket the possible power 
range that encompasses the choice of WT. 
The area equation is attainable by scaling each processor as to the 
quantity of square units it requires on the circuit board with allowance 
made for the bus structure, power, and circuit board configuration. Let 
u1 equal the necessary units for implementation of processor Pi. The 
subsequent equation will be 
The right-hand side, UT' is the total circuit realistate allocated for 
the array structure hardware and bus system. The 1 imits on the range of 
UT are obtained from the W~ terms generated from the equation 
l~! = U.P., i = 1, 2, 3, 
l 1 l 
The largest and smallest terms of W~ provide the upper and lower limits 
for the value of UT. 
54 
The resulting linear program is of the form: 
Minimize: 
Constraints: 
T1P1 + T2P2 + T3P3 + + TNPN > Size 
w1P1 + W2P2 + W3P3 + + WNPN ~ WT 
u1P1 + U2P2 + U3P3 + + UNPN ~ UT. 
The solution to the linear program will exist in a region bounded 
above the time line and below the power and area lines. Prior to attempt-
ing to obtain the optimal solution, the solution region should be examined 
to determine if it exists in such a state that will allow the existence of 
a feasible solution. At this point a reduction or increase of the solu-
tion region is achieved by altering the values of WT and UT. This capa-
bility will facilitate the search for the integer linear program solution 
by effectively reducing the search domain. 
The solution to the integer linear program is generated by using 
available computer software and computer systems. The technique is to 
use a branch and bound algorithm based on the Land and Doig (32) method. 
Details of the algorithm are covered in Appendix A. The end result of 
the linear program will be a circuit of a practical nature in an optimal 
form to solve a vector, matrix product computation. 
Optimal Two Phase Design 
A graphic illustration of the parameter characteristics of the de-
sign scheme is easily viewed in a two-dimensional problem. Assume the 
55 
matrix to be processed has 30 terms arranged in a 5x6 array. The vari-
able size will be equated with the number of terms composing the array. 
Size = 30. 
The total processing time for the array is set at one ~seconds. 
This interpretation of Tc implies that the entire array will be processed 
in 1 microsecond or less, with the system ready to undertake the next 
operation. From the stipulations on T , the selection of adequate hard-
c 
ware can be resolved. First, the cycle times of each of the two types 
of processors must fall below the Tc value. Let the individual cycle 
times of the two types of processor be specified as Tpl and Tp 2' which 
results in time constraint parameters 
Tl = largest integer (Tc/Tpl) 
T2 = largest integer (Tc/TP2). 
The consequence of the parameter values is depicted in the time con-
straint equation 
The boundaries of the equation acquired in the form of Pl and P2 are 
p• = 1 Size/T1 
The variables Pl and P2 each represent the quantity of processor units 
essential to compute the solution if only units of type P1 or P2 are 
employed in the design. Pi and P2 subsequently introduce the limits on 
the maximum number of computing components fundamental to the problem. 
The power specification of processor P1 and P2 are, respectively, 
w1 and w2. This results in the subsequent power equation 
56 
The limits on the power equation are generated by alternately zeroing 
P1, then P2, and in each case computing the number of units necessary to 
handle the problem. The limits are evaluated in terms of 
w• = w p• 1 1 1 
The 1 arger of the two va 1 ues w1 or t~ 2 represents the upper 1 imi t and the 
other equates to the lower limit. Between these limits the design value 
of WT is obtained. The boundaries stipulate the practical range of t4T 
values that can be employed in a realizable design. 
The circuit area constraint equation requires more in-depth consid-
eration prior to its evaluation. The appraisal of the circuit area is a 
function of the circuit board construction technique employed, as well 
as the bus structure utilized. The numerical quantities u1 and u2 are 
i ndi ca.tive of the area required to fabricate one hardware unit of type 
P1 and one of type P2. A prime consideration of the fabrication tech-
nique should be the cost function connected with the production of the 
circuit. Once the formality of the circuit area requirements is deter-
mined, the area equation is generated in the form: 
With the use of Pi and P~, the limit constraints on the equation are pro-
cured in a similar fashion as those of the power equation. 
57 
U' = U'P' 1 1 1 
U' = U'P' 2 2 2 
The values of Ul and u2 will produce the boundaries bracketing the selec-
tion of variable UT. The resulting problem equation is written in the 
form: 
Minimize: 
c1P1 + C/2 ~ z 
Constraints: 
T1P1 + Tl2 > Size 
Hl Pl + ~42P 2 ~ WT 
UlPl + Ul2 ~ UT" 
Assume processors of type 
period Tc' while processors of 
P1 are capable of two cycles in the time 
type two can execute only one cycle. Also 
allow the power and unit area needs of both processors to be equal. This 
wi 11 impose constraint equations of the form: 
2P1 + 1P2 > 30 
1 pl + lP2 = X, 15 ;: X < 30 
1 p 1 + 1P2 = y' 15 ;: y < 30. 
Figure 12 shows the time equation plotted in contrast to the upper 
and lower limits of the power equation and Figure 13 shows the time line 
and limits of the area equation. Different constraints on both power and 
area will produce lines on the plot parallel to the boundary lines, such 
as the lines shown in Figure 14. With the capability of moving the power 
40 
30 
V') 
0::: 
0 
V') 
V') 
w 
u 20 0 
0::: 
c... 
w 
c... 
>-
1-
....-
c... 
10 
2 p1 + 1 p = 30 2 
1 p1 + 1 p2 = 15///30 
TIME 
POWER (HIGH) 
POWER (LOW ) 
10 20 30 
P2 TYPE PROCESSORS 
Figure 12. Time and Power Equations 
58 
40 
40 
30 
(/) 
a: 
0 
(/) 
(/) 
LJ.J 
u 
0 20 a: 
D.. 
LJ.J 
D.. 
>-
1-
....-
D.. 
10 
2 p1 + 1 p = 30 2 
1 p1 + 1 p2 = 15///30 
TU1E 
AREA (HIGH) 
AREA (LOW ) 
10 20 30 
P2 TYPE PROCESSORS 
Figure 13. Time and Area Equations 
59 
40 
60 
40 
30 POWER LINE 
t/) 
0:: TIME LINE 0 
t/) 
t/) 
w 
u 
0 20 0:: 
c... 
w INTEGER SOLUTIONS c... 
>-
1-
,..... 
c... 
10 
AREA LINE 
10 20 30 40 
P2 TYPE PROCESSORS 
Figure 14. Integer Solutions of Problem 
61 
and area equation lines on the plot, it is possible to reduce the region 
in which the solution will reside. Figure 14 illustrates a linear pro-
gram plot with the time, power, and area equations shown. The integer 
solutions are also shown on the plot as they occur in the region above 
the time line and below the power and area lines. The optimal solution 
is obtained by testing each of the solutions in the cost equation and 
determining which solution gives the minimum cost to the system. The 
optimal solution can be computed ~ither graphically for small problems 
or by using the computer program discussed in Appendix A. Prior to the 
use of the optimal program, a program of the type shown in Appendix B 
can be used to plot the linear equations or just check the boundaries of 
the constraint equations to assure that it is possible to obtain a solu-
tion with the constraint values. 
Summary 
The design characteristics of an array processor can be altered to 
produce a multi-phase form, capable of optimal operation. The concept 
is to utilize array processor components of different cycle times and 
effectively operate them in parallel in a SIMD environment. Using the 
operating characteristics of available hardware, a designer can formu-
late a set of linear equations, solvable with standard linear programming 
software, and generate a practical circuit configuration. The entire de-
sign package lends itself well to an interactive computer program. Soft-
ware of this nature would complement a designer's ability to make deci-
sions as to practical design capabilities of available hardware. A sum-
mary of the steps used to design the Multi-phase processor system and 
a sequence flow chart are given in Appendix E. 
CHAPTER V 
PROCESSING ELEMENT DESIGN 
Introduction 
Solution of the optimal linear program marks the culmination of the 
overall circuit configuration problem. Subsequent design problems are 
approached ultilizing similarly defined constraints imposed on the overall 
system. The SIMD definition stipulates that individual processors must 
simultaneously operate on different data streams. In addition, the array 
processor subclass definition further invoke that all processing ele-
ments are to perform identical operations concurrently on different data 
streams under control of one instruction. Subsequently, high speed data 
paths are imperative between processor units. The cardinal directive of 
the subclass is that all data fetched in one time frame is processed in 
one time frame and stored in one time frame. These and other stipula-
tions that categorize array processor design must translate into design 
criteria of the individual processor units that compose the array system. 
A relationsh.ip between the structure exhibited by the problem and the 
concept of the hardware design should exist analogous to the constraints 
placed on the origonal design concept. 
The Design of Processor Bus Structure 
For Vector, Matrix Products 
If the hardware structure internal to the processing unit is based 
62 
63 
on the algorithm used, a study should be conducted to coordinate certain 
steps in the algorithm to the implementation of hardware. Initially, 
determination of the sequence of the vector operation (multiplies and 
additions) should be obtained in an effort to postulate the maximum num-
ber of concurrent operations that can co-exist in one time interval. 
The operations fundamental to anN by (N+M) matrix will be the order of 
N (N+M-1) additions. The dilemma is to determine in what order the 
multiplications and additions should occur in order to maximize the array 
capabilities of the system. 
Certain ad hoc methods of studying the possible combinations of mul-
tiplications and addition schemes in a matrix, vector product will provide 
insight into the complexity of a combined algorithm hardware solution. 
These methods culminate in the realization that an overall analytical 
procedure is imperative to deduce a practical solution. A technique sug-
gested by Torng and \!Jilhelm ( 29) to optimize interconnections of central 
processor registers suggests that the maximum number of data lines is 
best determined by using linear dynamic programming methods. The Torng 
and Wilhelm technique is initiated by defining a transfer matrix consis-
ting of P x P elements, where each term of the matrix represents a transfer 
from one register to another. The matrix assists in charting concurrent 
and sequential operations between registers in a computer in order to 
determine the minimum bus requirements and maximum data flow. This con-
cept is expandable to an array system to maximize data flow and minimize 
the bus structure. 
To utilize this technique on a distributed architecture system that 
is capable of solving a vector, matrix product requires the formation of 
two transfer matrices. The first matrix (Figure 15) is established to 
64 
VECTOR TERMS 
xl X3 • • • • • • • • • • • • • • XN + N 
V) AlN 
::E 
0::: 
w 
1-
X A2l ....... 
0::: 
1-
.:::( 
::E 
A22 
Figure 15. Multiply Table 
65 
study the addition schemes. This matrix is composed of columns designated 
by the terms in the vector of the vector, matrix problem. If the order 
of the problem matrix is N by (N+M), then the first transfer matrix will 
haveN+ M columns and N (N+M) rows. A check mark placed at the inter-
section of a column and a row designates a required multiplication. This 
table is constructed first since the multiplication must precede the 
addition in the solution of the product. 
Under the SIMD array processing criteria, all data values that are 
processed in one interval must be fetched in one interval. It is evident 
from the multiply table that if (N+M) elements of the vector are fetched 
in one period, theN (N+M) multiplies could be accomplished in one time 
interval. This is evident by scanning down the columns and noting the 
total number of checks in each column that correspond to elements that 
have been fetched. Note, that to fetch the (N+M) vector elements in one 
interval requires (N+M) bus systems to provide data transport from their 
storage locations. Furthermore, to accomplish the parallel processing of 
theN (N+M) multipliers and N (N + M) additional bus paths to fetch the 
elements of the matrix in one time interval. An examination of rows of 
the multiply table shows that fetching any one element of the matrix will 
result in only one possible operation. However, by looking down the col-
umns, it is evident that if one vector element is fetched, then N multi-
plies are concurrently realizable by fetching the one vector element and 
N matrix elements. 
The formation of the addition table is somewhat less routine as the 
terms must be separated into two groups of relatively equal size and de-
posited on the extremities of the table with one group of terms on the 
top and one group along the side. Check marks are placed on the grid 
66 
corresponding to terms that must be added, with the stipulation that 
addition of a term occurs only once on the chart. From this matrix grid, 
it is evident that only VAL number of additions may occur concurrently, 
where 
·VAL = largest integer [N (N+.M)/2]. 
There are N (N + M- l) additions necessary with (N + M- l) addition 
terms occurring in each row that must take place in pairs. Therefore, 
[VALl] designates the number of additions possible in one row in a single 
time interval. 
[VALl] = largest integer [N (N+M-l)/2]. 
The remainder (R) term of this integer division indicates an additional 
adder interval requirement. From Figure 15, note that if column one of 
the matrix is fetched, it will require N+ 1 bus paths, and theN products 
consisting of terms produced by multiplying x1 times each of the column 
terms are generated with N multipliers. The N partial terms are concur-
rently stored and then N more are produced and added to the stored terms 
~nd so on across the row. This action constitutes a maximum parallel 
operation of multiplication and addition with a minimum hardware require-
ment. This process reduces the need of tree configured circuits that re-
duce the parallel processing capability of the system. This result is in 
agreement with Torng and Wilhelm (29), since they have pointed out that 
the number of bus paths necessary is equal to the number of simultaneous 
data transfers required. 
This structure will require a multiply, add, and accumulate technique 
for each element to be processed in parallel and N+ 1 bus paths used to 
fetch data. Note that if the multiplier is fed from a magazine loader or 
67 
first-in-first-out buffer, that the control and bus path complexity is 
significantly reduced. To adhere to the requirement of array processing, 
the N results accumulated in the scheme will require N bus paths to re-
turn the data to memory concurrently as shown in Figure 16. This scheme 
implements N+ 1 bus paths and the capabiility of storing theN new values 
into memory synchronously. The memory realization is possible by utiliz-
ing shift registers as the memory for the z•s and v•z, since they need to 
shift one value of Z. onto the input bus in a sequential action. This 
1 
action will open the required area in memory to allow the accumulated 
terms to be stored at the completion of a cycle. The A and K memories 
will operate in a similar manner and can be implemented with a shift reg-
ister system of memories. 
Design of Internal Processor Configuration 
The completion of the processor bus structure leads to the design of 
the internal interconnections of the individual processor units. The 
original design criteria carry over into the design of the internal hard-
ware of the processing elements. The concept is to maximize the through-
put in the processor unit by implementing the maximum parallel operation 
and data flow. 
Assuming the use of floating point two•s complement numbers, it is 
necessary to first define the unit necessary to perform a multiply, add, 
and accumulate process on two signed binary numbers. The mantissas must 
be multiplied and the exponenets must be added to produce a floating 
point product. This product must be shifted to the right or to the left, 
and its exponent increased or decreased for each shift to make the expo-
nent match the exponent of the accumulated sum to which it will be added. 
A & K A & K A & K A & K 
Z and Y 
~ ~ ~ 
, 1 1 , + 1 ~ 
PROCESSOR PROCESSOR PROCESSOR PROCESSOR 
I 
Figure 16. Circuit for the Parallel Storage of Data with Least Bus Lines 
The standard structure to perform these operations is of the form shown 
in Figure 17. 
69 
Assuming the elements of the matrix are made available in normalized 
two•s complement floating point format, prior to the calculations, a great 
saving in design complexity can be obtained. The knowledge that all num-
bers will arrive at the processor input in the same format will allow the 
renormalization and shifting of exponents to be accomplished in parallel 
with the multiplication and thereby increase the throughput by not wait-
ing until the product is formed to check its exponent and shift the man-
tissa as required prior to addition. For example, a 32-bit floating 
point number will have a 23-bit mantissa plus one sign bit and a 7-bit 
exponent plus a sign bit. The multiplication of two 23-bit numbers will 
typically require 250 ns, and the addition of two 8-bit numbers will re-
quire around 60 ns with present technology. Once the exponent of a new 
product is formed, it is necessary to compare it to the accumulated expo-
nent to determine the number of right or left shifts necessary prior to 
adding the new product to the accumulated sum. This will be done by sub-
tracting the new exponent and accumulated exponent and will require typ-
ically another 70 ns. At this point, the number of shifts of the mantissa 
necessary prior to addition to the accumulated total is known and approx-
imately 120 ns remains before the next mantissa becomes available. Since 
the numbers going into the multipliers are both nornalized, the results of 
the multiplications will require at most one left shift to place it in 
normalized form (17,. It is possible to determine if the product will 
require shifting after multiplication by analyzing the multiplier and 
multiplicand prior to the multiplication. Given prior knowledge of the 
resulting position of the product, in terms of normalization, will allow 
X DATA Y DATA X EXP Y EXP 
MULTIPLIER/ I I ADDER/// 
lr 
SHIFTER 
NET\~ORK INC or DEC 
ADDER//// 1--
ACCUMULATOR 
Figure 17. Floating Point Multiplier 
70 
71 
the 120 ns remaining before the products are available can be used to set 
a shift network of multiplexers such that the output of the multiplier 
may be fed into the adder without further delay. This will preclude the 
normal procedure of latching the product, and checking and shifting it, 
prior to the add and accumulate process. A similar method must be used 
to place the accumulated sum into normalized form prior to storing it in 
memory for use in later computations. It is possible to process the accu-
mulated data in the last adder stage with the use of look-ahead-carry 
techniques which will adjust a shift network that places the data on 
the bus in normalized form. 
Two problems remain to be faced in the processing units. First, the 
size of the accumulator to hold the row products must be determined so 
that an overflow of the accumulator will not occur. Second, the accumu-
lated element products must be rounded to the proper number of bits to 
match the data bus and memory size. 
The maximum size of any product term A· x1 is 
N-2 
20 + L 22N - 1 - K 
K=O 
and there are N+M products in a row to be accumulated. N equals the 
number of bits in the multiplier or multiplicand, as well as to the size 
of the data bus. 
The maximum size of the accumulator is then 
N-2 
s = (N + M)(20 + L 22N- 1 - K) • 
K=O 
The number of bits in the second term will be 2N (where N = the number of 
bits in one data word). Therefore, the required number of bits of the 
72 
accumulator wi 11 be 2N plus the number of bits necessary to express the 
term (N+t~) in binary. For example, if there are 8 bits in a data word, 
then there will be a maximum of 16 bits in the product, and if there are 
5 elements in the matrix row, then (5 = 101) or 3 bits will be necessary 
and maximum size of the accumulator will be 19 bits. The need for shift-
ing the accumulator after each addition of a new partial product can be 
stopped by allowing enough bits in the accumulator register to stop over-
flow under the worst case conditions. Once the terms are added in the 
accumulator, they may be rounded after they are normalized. There are 
two possible normalization methods commonly used: one is to make the 
lowest bit of the bits to be kept a one; the other is to add the MSB of 
the bits to be discarded to the bits to be retained. The fastest method 
to process the data will be to carry guard bits to allow only truncation 
or rounding at the.interface to some other device which is driven by the 
output of the computations. 
Look-Ahead Shift to Increase Floating 
Point Operations 
Assuming that the input numbers to the floating point multiplier are 
in two•s complement normalized form, the product will appear as 
X0 Multiplicand 
Y0 t~ultiplier. 
By definition of normalization, X and Y will be equal to one and S and 
n n x 
S will be equal to zero for positive numbers. The largest product pos-
Y 
sible from an N Bit Multiplicand and an N Bit Multiplier will be 
73 
N-2 2N- 1 - K Value max = 20 + I: 2 
K=O 
The form of this product is best illustrated by exar:1ples. 
11 111 1111 11111 
X 11 X 111 X llll X lllll 
1001 110001 11100001 1111000001 
The 2° bit is always set as well as the higher order N- 1 bits, where N 
is the number of bits in the multiplier or multiplicand. Note that these 
examples result in positive normalized products and do not require a left 
shift in any of the cases. Furthermore, as long as the numbers are nor-
malized prior to multiplication, they will at most only require one shift 
to the left as a consequence of the normalization, which is established 
when the MSB and the sign bit are not equal. This normalization require-
ment is always true except for the binary numbers in the form: S.XXX = 
1 . 1 000 ... 
The question is how can knowledge of the shift or no shift situation 
be obtained prior to the multiplication operation. There are four cases 
that must be studied. Case 1 will consist of both X and Y being positive 
normalized numbers with x1 and v1 equal to one by definition of normali-
zation. 
If x2 and Y2 are both 1 and all lesser bits are zero, the product will be 
of the form 22N- 2 + 22N- 4. This result will be the smallest product 
value resulting from the multiplication of two N bit numbers having the 
two ~~SB of each number set to one • s. The results show that a product of 
two positive normal binary numbers with x1, x2, v1, v2 equal to one will 
never need a shift in the product to normalize the result. 
For Case 2, let x1 = v1 = l by definition of normalization and let 
x2 and v2 = 0. Since the largest product possible under these circum-
stances will be of the form 
S.l 0 1 1 
xS.lOl l 
74 
it is well known that the product of two N bit numbers will produce at 
most a 2N bit product. Also the product of 2N bit numbers (where N = 4), 
which are in the form states in this case, will result in a 2N bit product 
that will require normalization. A method must be determined to obtain 
the shift information from the multiplier and multiplicand in this case. 
This is accomplished by holding the multiplicand in the form 
S.1 0 1 l 1 ... XN 
and finding a multiplier of the form 
that will just cause an overflow into the 22N bit position. The differ-
ence between these two numbers will be the range that must be tested by 
addition in the limiting case. For example, two 6-bit numbers of the 
form S.l 0 1 1 1. 
s. 1 0 1 1 1 
= 
47 
X s. 1 0 1 1 1 X 47 
2209 
These two 6-bit numbers will produce at most a 12-bit product. 
B i ts 1 2 11 1 0 9 8 7 6 5 4 3 2 1 
Max Value 2048 1024 512 256 128 64 32 16 8 4 2 1 
Note that if no shift is required, the product will exceed 2048 (12th 
bit position value). To find the shift value, the product must be less 
than 2048. By holding the multiplicant at (S.l 0 l l l 1) = 47, and 
finding a number of the form (S.l 0 X X X) that produces a product just 
below 2048 will result in the determination of the shift limit. In this 
75 
case 43 will produce a product of 2021 and 44 will produce a product of 
2068. The limiting number that will require a shift will be S.l 0 l l 0 2 
= 43 and the difference in the multiplicand and the multiplier is 47-43 = 
4. This indicates that the rest of the bits over the range of zero to 
four will determine if the shift is or is not necessary. The test can be 
made by addition and the test on the higher order bits x1, x2, and Y1, Y2 
can be made by AND operations. Notice the form 
S.lOllll 
xS.lOllll 
If x2 and Y2 are zero, then x3 and Y3 and x4 and Y4 must be l 's to create 
a no shift condition. Bits x5 ,x6 and Y6 may change over a range of 4 and 
result in a no shift situation. Any other change in the last two bits in 
each number will result in a no shift condition. 
S.lOllXX 
x S.l 0 1 1 X X 
12 
Summary of Case 2 with x2 and Y2 equal to zero is 
} ~hift possible; any other combination results 1n no shift possible. 
xs x6 
Ys v6 
if sum > 4 + No shift; 
if sum ~ 4 +Shift. 
In Case 3, the form is 
S.l 0 X X 
x S.l 1 Y Y or 
S.l l X X 
X S.l 0 y y 
In this case, the shift limit is again 22N- 1 , and in the same 6-bit ex-
ample 
S.l 0 0 0 0 0 
X S.l 1 0 0 0 0 
= 32 
= X 48 
76 
By holding the 48 and adjusting the 32, the limit may be reached. For 
Case 2 it is found that since the add limit of the two numbers is 90, the 
multiply limit with 48 will be 42. To continue the same example with new 
form: 
s. 1 0 1 0 1 0 
X S.l 1 0 0 0 0 
= 42 
= X 48 
The base required form is S.l 0 and S.l 1 which correspond to 32 and 48, 
which sum to 80. This implies that the last four bits of each number must 
add to 10 or less to require a shift and if they are lareger than 10, no 
shift is required. 
The negative case may be studied in a similiar manner but the best 
approach to the overall problem is to only consider positive cases. The 
negative product obtained in a standard system is run through a two's 
complementer to get a positive product, then shifted to normal form and 
put back through the two's complementer to obtain a negative value in 
77 
normalized form. The tests may all be run with positive values; then, 
knowing the shift results, a negative product may be shifted without 
being two's complemented and tested. This result can be utilized with 
the shift results of the exponents to gate the output of the multiplica-
tion into the accumulating adder with only a small gate delay from the 
shift multiplexers. 
Circuit Configuration with Look-Ahead Shift 
The basic components of the floating point processor are the multi-
plier, exponent adder, accumulator adder, accumulator, and look-ahead 
shift system. These components are shown in Figure 18. 
If the binary numbers X and Y are to be multiplied with this scheme, 
the mantissas and exponents are entered in the appropriate inputs. The 
look-ahead shift system determines if the result of the multiplication 
will require a shift for normalization and will complete its analysis long 
before the product is available. The two exponents must be added in a 
two's complement adder, and the result is compared with the exponent 
stored in the accumulator. The accumulator and new product exponents must 
match in magnitude before the addition and accumulation can take place. 
The results of the exponent addition will be available prior to the prod-
uct of X andY. Ample time is available to allow the exponent adjust sys-
tem to adjust the shift multiplexer in coordination with the look-ahead 
shifter to preset th~ shift multiplexer and gate the product of X and Y 
into the adder in proper normalized form to allow addition and accumulation 
of the results. 
At the onset of a new cycle, the accumulator is set to zero and the 
exponent adjust system is notified that a new cycle has begun. The first 
X 
-
LOOK 
AHEAD 
SHIFTER 
y 
r----
X y 
- -
MULTIPLIER/ I I 
SHIFT 
NETWORK 
ADDER UNIT 
X EXP 
-
ADDER//// 
EXP 
ADJUST 
SYSTEM 
ZERO 
DETECT 
UNIT 
ACCUMULATOR UNIT 
Figure 18. Floating Point Multiplier with Look-Ahead Unit 
78 
product will be added to the zero accumulator after the normalization 
shift requirement of the look-ahead shifter has been met. The remainder 
of the input data products to be accumulated will undergo normalization 
requirements generated by both the look-ahead shifter and exponent 
shifter. The conclusion of a row of the matrix can be determined by 
counting the number of inputs to the accumulators to determine if row 
calculations have been completed. This accumulation counting technique 
will be most acceptable to the system with the expansion to the optimal 
structure of processors. 
Effects of Multiple Phase Operation on the 
Processing Element Design 
The addition of faster or slower processing elements or individual 
multiplier units to the array structure will serve to complicate the 
problems of addition and accumulation of row terms in the process. This 
situation arises since the individual terms of a row of the matrix and 
the column of the vector, once multiplied, must all be summed to produce 
a term in the resulting vector. The required addition of these terms 
provides a problem area in the search for an efficient, inexpensive row 
processing scheme. One possibility is to use an adder and accumulator 
for each multiplier unit and therefore treat the design of all the pro-
cessing elements as a standard structure. The hindrance in this scheme 
is that the accumulat2d data of the elements must further be added to 
produce the correct row values as shown in Figure 19. 
This approach will require the first two accumulated values to be 
adjusted prior to addition to obtain the row value. Extra hardware is 
required to accomplish these tasks and extra time will elapse during 
79 
80 
I LOOK I I AHEAD MPLY i I ADDER _, I LOOK I I AHEAD MPLY ADDER I 
~----~~ SHIFT INC/DEC I ~SHIFT INC/DEC I 
f 
-' 
lADDER/ :t-----' I ADDER/ I 
ACCUMULATOR 
SHIFTER//// 
ADDER 
ACCUMULATOR ROW RESULT 
Figure 19. Processor Circuit with Two Processor Elements 
81 
these steps. An alternate approach is to combine the second multiplier 
into the first processing element while utilizing only one adder and accu-
mulator to sum the row values in a time share mode, as shown in Figure 20, 
An efficient time use of the adder and accumulator is possible due to 
the difference in operating speeds of the multipliers in the processing 
elements. Variations of both these schemes can be used as the require-
ments of the optimal solution results in different processing element re-
quirements. 
In the upper limiting case there exists one processor element for 
each term in the array, and a tree structure of adders may be required to 
compute the row values. As the requirements generated by the optimal 
solution change, resulting in slower processors being added to structure, 
it becomes conceivable to combine fast and slow processors in a given 
processing element. With further alterations in the optimal solution, it 
may not be possible to combine the results of a fast and slow multiplier 
concurrent with the next multiply time and the need for addition adders 
will result. The limiting factor in the process of combining more than 
one multiplier and shift unit to operate in conjunction with one adder 
and accumulator will be the number of additions that can be accomplished 
during the multiply period. For example, if an entire row of processing 
multipliers is used, one for each term in a row of the matrix, it may be 
possible to use one adder and accumulator to combine the results of all 
the elements using a time share data collection technique. This is accep-
table provided that all the data terms can be accumulated in one multiply 
time. 
82 
t----...,-----------IADDER/ 
ACCU~ULATOR 
Figure 20. Floating Point Multiplier with Interleaving Design 
83 
Summary 
The initial problem in the processor structure design is to deter-
mine the most feasible plan of calculating the terms of the matrix, vec-
tor problem, considering the problem structure. Certain techniques exist 
to facilitate the search for a practical solution to a minimal component 
processor configuration. Once the decision on a hardware structure inter-
nal to the processing elements is culminated, a further study into possi-
ble alterations caused by variations in the optimal solution of the 
processor configuration is of prime concern. The alterations generated 
by changing parameters in the problem requirements will result in possible 
complications or simplifications of the internal processor hardware. 
In adherence with the design constraints of an array processor struc-
ture, certain improvements can be provided to a floating point hardware 
structure to improve its parallel operation. Knowledge of the binary num-
ber format and the problem algorithm aid the removal of the binary number 
normalization process from a serial configuration in hardware. The nor-
malization processes resulting from multiplication and from addition can 
be preset concurrently with the multiplication of two terms at the start 
of each new cycle. 
With the design constraints implemented in the structure and possible 
improvements to the circuitry generated, concern is shifted to the effects 
of existing hardware and the problems that standard parts might impose on 
the desired circuits. The effects of the multiple phase type structure 
and possible advantages and disadvantages of hardware were next considered 
and discussed. The final result of this chapter is that the design con-
straints and problem algorithm limitations have been carried through the 
system design process down to the lowest unit of the system. 
CHAPTER VI 
APPLICATION OF DISTRIBUTED ARCHITECTURE 
TO LINEAR RECURSIVE FILTERS 
Introduction 
One class of algotithm that employes a vector, matrix multiplication 
is the linear recursive filter, which is composed of both full and sub-
optimal forms. This class of algorithm lends itself well to the proper-
ties of distributed architecture exhibited through the use of an optimal 
hareware design and its subsequent flexible throughput capability. This 
chapter will illustrate how a full order Kalman filter can be constructed 
in an optimal distributed architecture system. 
Full Order Kalman Filter 
A common problem in data transfer systems is the recovery of a pulse 
signal that has been corrupted by noise and has been distorted by being 
passed through a linear network such as a transmission line. Let a(t) be 
an input signal that has been corrupted by outside noise signals and sub-
sequently passed throuth a system having a transfer function of b/(S + b). 
The output signal y(t) is further corrupted by measurement noise W(t), 
resulting in a signal r{t) which is the observed output signal of the 
modeled system. The transfer function of the transmission line is denoted 
to include all influences caused by linear distortion acting upon the 
input signal a(t). 
84 
85 
The input signal a(t) will be a Manchester bi-phase pulse train of 
the type shown in Figure 21. This signal will be approximated with a 
Poisson-distributed-zero-crossing bi-level pulse train which has a well 
known autocorrelation function of the form E~ e-2kiTI where Em is the 
peak amplitude of the signal. The choice of the Poisson signal will 
allow an analytical approach to deturmine the transfer function of a 
linear system that will produce an approximation of the Manchester 
bi-phase signal at its output when its input is white noise. The trans-
fer function of the signal generator and the transmission line are crit-
ical in the deturmination of the Kalman filter equations necessary to 
U(t) 
U(t) 
GENERATOR 
GENERATOR 
a(t) 
WAVE FORM OUTPUT 
u1(t) w (t) 
b 
s+b 
Figure 21. System Model Diagrams 
process the Manchester bi-phase data. Figure 21 also illustrates the 
system model to include the transfer of the transmission line and the 
additive noise. The results of this system model will be a composit 
signal called r(t) that will provide the input to the Kalman Filter. 
The power density spectrum of the random bi-level signal is given 
by 
S (w) = 
a 
where w = 2nf. 
The required linear system can be obtained by setting Sa(w) equal to the 
product of its complex conjugates 
1 1 
2K + jw 2K - jw 
G+(jw) is defined by the equation: 
The transfer function can be altered to arrive at a similar form 
s ! a , by multipling Sa(w) by 4K2 
4K2 
Be defining G+(jw) as 
2K 
2K + jw ' 
which results in 
the transfer function of the system can be determined. Utilizing this 
equation, the input power spectral density is computed by: 
sul (w) = [(2K) 2 + w2] (2K) 2 = 
2 E2 
m 
-K. 
86 
The random signal a(t) can be presented mathematically by the linear 
model 
U(t) -{ 2K 
2K + S 
87 
The total system model takes on the form shown in Figure 22 and b/(S + b) 
is the transfer function of the linear transmission media through which 
the signal is transmitted. 
Kalman Filter Realization 
Kalman filtering is based on the assumption that any random process 
can be modeled with a system which passes white noise through a linear 
circuit. The filter equations can be solved in the discrete form and 
values of the gain matrix can be generated for each sample time to be 
used by the system until it reaches steady state. In systems with a fast 
time period, the steady state values can be used to form the gain matrix 
and reduce the normal memory requirement of the computer. 
The state variable form of the system will be: 
X = AX + BU(t) 
z = ex+ w(t). 
U(t} and W(t) are assumed to be white noise processes such that: 
E[U(t)UT(T)] = Qo(t r) 
E[W(t)WT(,)] = Ro(t r). 
The state equations are determined to be: 
W(t) 
2k 
Figure 22. The Total Signal Generating Molel Diagram 
1----- r(t) 
co 
co 
. 
x, -b b x, 0 
= + u, ( t) 
. 
A 
x2 0 -2K x2 2K 
z = [1 -o] x + W(t) 
2E2 No Q = __!!!_ R = K 2 
The solution can be obtained from the equation 
. t 
x<tl = ~<t- t 0 lx<tl + J( ~<t - ,JB<,>D<,JdT . 
to 
For a discrete system 
Therefore, 
let t = tK+l 
tK+l 
X(tK+l) = ~(tK+l - tK)X(tk)+ 1 <P(tK+l- T)B(T)O(T)dT 
tK 
If a constant sampling rate ofT= tK+l - tK is assumed, then 
where 
The state transition matrix ¢(t) is found by letting 
89 
90 
where 
1 b 
S + b (S + 2K){S +b) 
[SI - Af 1 = 
0 1 S + 2K 
The transform of the (l ~2) term is found as follows: 
A B 
S + 2K + S + b 
A 1 im b b = s + b = b - 2K S -r -2K 
B 1 im b = b = S + 2K 2K - b s -T -b 
b [ l ]+ b [ 1 ] b - 2K S + 2K 2K - b S + b 
b -2Kt b -bt 
= b - 2K e + 2K - b e 
Therefore, the state transition matrix is: 
¢( t) = 
-bt b e-2Kt + b -bt 
e b - 2K 2K - b e 
0 -2Kt e 
The mean value of the white noise driver is 
ftK+1 E[U(K)] = ¢(tK+l - T)BE[U(,)]dT = 0 
tK 
91 
and the covariance is found as follows: 
ftK+l E[U(K)UT(K)] = E ¢(tK+l- t)BU(T)dT 
tK 
= 
The matrix computations leading to the evaluation of the Q matrix can be 
seen in Appendix D. 
The discrete Kalman filter has a dynamic model of the form 
X.+l =~X. + DU. with U. : N(O,Q.) J J J J J 
and the obseration model is 
The algorithm for generating the discrete estimates for each sample time 
T is 
where 
·xJ.+l IJ. = ~xJ. d x an 0 = 110 • 
The gain values for each estimate are derived from the equation 
T T ]-1 Kj+l = pj+ljj H [H pj+l lj H + Rj+l ' 
where 
The algorithm is altered to a form more applicable to the distributed 
system by combining like terms to form: 
92 
93 
The values of the gain at each update are computed in off line simulation 
and stored in the system memory along with the terms of the Q matrix. In 
some problems of this type the steady state gain values are used at each 
update which allows less memory to be used to store data constants. The 
gain values of the Kalman filter problem are listed in Appendix C and it 
is evident that the gains reach a steady state value of KJ(l) = .215 
and KJ(2) = 1.07. Further details of the program coding and format are 
shown in Appendix C~ 
Distributed Architecture Equation Format 
The matrix equation for implementing the Kalman filter problem is 
the form: 
A 
xj+l(l) tPn ¢!12 xj ( 1) Kl 
of 
[ l[~lr H2] = + zj+l A 
xj+l(2) 1 "'2 ¢!21 ¢22 xj(2) K2 
The value of the H vector is [l 0], indicating that this is a single 
observer filter. The terms of the equation can be further reduced to form 
the values of the constants to be stored in memory. 
= + 
and by combining the matrix equations into one matrix, vector multiplica-
tion) the proper form for implementation in hardware is reached. 
xj+l(l) 
= 
xj+1(2) 
If the steady state gain values are used in a problem, the required 
memory for constants is reduced to the number of terms in the matrix. 
Such situations ~t1ill further reduce the complexity of the design and 
directly affect the size, power, and cost of the system. In problems 
which to not lend themselves to this reduction process, a memory space 
is required to store a new set of values for each term in the matrix at 
each update calculation until steady state is reached. 
Linear Program Formulation 
The Kalman filter will be implemented with two types of processors, 
each with different operating characteristics. 
Processor 1 
Processor 2 
Cycle Time 
450 nsec 
800 nsec 
Watts/Chip 
5 watts 
l watt 
Units 
Area/Chip 
3 units 
8 units 
The time required for each update calculation is chosen to be one micro-
second. At this point, the linear program equations can be formed and 
the boundaries set on power and size for the design by the following 
steps: 
Tc = l microsecond 
T1 = largest integer (Tc/Tp1) = 1 11sec = 500 nsec· 2 
94 
95 
T2 = largest integer (Tc/Tp1) = 1 ].lsec = 1 900 nsec 
I 
pl = matrix e1ement/T1 = 6/2 = 3 
I 
p2 = matrix elements/T2 = 6/1 = 6 
Watts max = Pl(WPl) = 3 • 5 = 15 
Watts min = P~(WP2) = 6 • 1 ::: 6 
Units max = P~(up 1 ) = 3 • 3 = 9 
Units min = P~(UP2) = 6 • 8 = 48. 
The constant equations are now generated in the form: 
2P1 + 1P2 ~ 6 time equation 
5P1 + 1P2 ~X, 14 > X > 6 power equation 
3P1 + 8P 2 ~ X, 48 > X > 9 area equation. 
With a design of a second order filter the linear conditions and limits 
are easily plotted as shown in Figure 23. Figure 23 serves as a mapping 
of the design limits of the system under the constraints chosen up to 
this point and allows a more adequate choice of the power and unit area 
constraints to further simplify the design. To illustrate this simpli-
fication, let the power limit be equal to 14watts and area limit equal 
32 square units. Figure 27 shows the graph of this design, from which 
the optimal solution to the design can be obtained from the area above 
the time line and below both the power and unit area lines. The purpose 
of this step in the design is to check the feasibility of the constraints 
to determining if the design is possible prior to continuing the design 
sequence. Alterations in the power and area limits will increase or 
8 
7 ~ 
\ 
6 \ 
V> 
1-
....... 
z 5 ::::> 
0:: 
0 
V> 
t/") 
4 w u 
0 
0:: 
0.. 
w 3 
0.. 
>-
1-
.-- 2 0.. 
----... 
1 TIME LINE 
I 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 
P2 TYPE PROCESSOR UNITS 
Figure 23. Plot of the Equations for the Kalman Filter Circuit 
(/') 4 ~ 1-
........ 
z \ ::::> 
0::: \ 0 (/) 
(/') 
w 
u 3 0 
0::: 
0.. 
w 
0.. 
>-
1-
..-
0... 2 
-
--.. 
1 TIME LINE 
1 2 3 4 5 6 7 
P2 TYPE PROCESSOR UNITS 
Figure 24. Integer Solutions to Kalman Filter Circuit 
98 
decrease the area from which the solution to the linear program can be 
located. The solution to the integer linear program is generated using 
a branch and bound technique and any reduction of the solution area will 
result in a reduction of the computation time to reach a solution. 
The cost equation of the integer linear program is generated by let-
ting c1 equal the cost of a processor of type P1 and c2 is set equal to 
the cost of a type P2 processor. Let c1 = 200 and c2 =50. The result-
ing cost equation will then be 
200 pl + 50 p2 = z. 
The solution to the integer linear program may be obtained using the Land 
and Doig method as discussed in AppendixA or in this case a graphical 
solution is possible as seen in the plot in Figure 24. The results of 
the computer solution using the Land and Doig method are shown in Figure 
25. The two solutions as seen in Figure 24 are P1 = 2 and P2 = 2 or 
P1 = 2 and P2 = 3. Using the cost equation of the program shows that the 
solution P1 = 2 and P2 = 2 will result in the minimum cost solution to 
the program with the constraint equations applied. 
Circuit Design of Kalman Filter Problem 
The results of the linear program solution are used in conjunction 
with Chapter IV to determine the configuration of the processor array and 
bus structure. In this problem, there will be two fast processors and 
two slow processors. Since the cycle time of the fast processor is 450 
nsec, and the cycle time of the slow processor is 800 nsec, it is feasi-
ble to team one fast and one slow processor together in the interleaving 
design to compute one of the two rows of the matrix. Figure 26 shows a 
TEST PROBLEM 
PRINT CCNTPOL PARAMETERS 
l 1 
RO~S X COLUMNS AND NO. OF lNTEGER ~ARIABLES 
-4 X 3 2 
UPPER BOUND ON VARIABLE 1 TO N 
0.~000+01 0.6000+01 
CO~STRAINT TYPES IN ROW ORDER 
1 -1 -1 
~ATRIX FOR~AT CODE 
0 
INPUT TABLEAU ECHO, CONSTRAINT VALUE LEFT. BY ROW, 
o.o 0.2000+03 o.suoo +02 
C.6000+0l 0.2000+01 0.1000+01 
0.1400+02 O.SOOQt-01 0.1000+01 
0.3200+02 0.3000+01 O.tiOUO +01 
INlT!AL WORKING TABLEAU 
0 1 2 
o.o 0.200000+03 0. 51)0 000 +02 
0,600000+01 -0.20000D+vl -0.100000+01 
-C.l4000D+02 0,500000+01 0.100000+01 
-0.~20000+02 0.300000+01 O,ij0UOOD+01 
CO~TJNUOUS SOLUTION COMPLETE 
FINAL TABLEAU FOR CONTINUOUS SOLUTION 
0 -3 -1 
-0.353850+01 0.153850+00 0.230770+00 
-(.430770+01 0.230770+00 0.284620+01 
-0.123080+01 -0.769230-0l -0.615380+00 
OBJECTIVE FUM:TION a 423.0769231 AT ITERATION 
STRUCTURAL VARIABLES: Xlll 
I ,. 1 z 
O.l23Dt0l 0.3540+01 
OBJECTIVE FUNCTION a 5vO,OOOOOOO AT ITERATION 
STRUCTURAL VtRIABlES: XIII 
t 2 1 2 
o.:coo+Ol o.zooo+01 
GP1TMALITY ESTABLISHED 
END OF PROBLEM, ITERATION NO. 9 
Figure 25. Land and Doig Output Data 
99 
z 
8 
[
X(l,l) 
X(2,1) 
X( 3, l ) 
SLOW 
PROCESSOR 
x(l,2)•s 
MEMORY 
FAST 
PROCESSOR 
NEW 
y3 
INPUT 
UP 
DATA 
v, 
...._----+----...__----+-~... ACCUMULATOR 
MEMORY MEMORY UP 
X(2,z)•s DATA 
X(2,l)•s AND y2 
X(2,3)•s 
t + 
Y / S AND Y 3• S 
SLOW FAST 
PROCESSORS PROCESSORS 
ACCUf~ULATOR 
Figure 26. Data Flow for the Kalman Filter 
100 
possible circuit configuration based on this design. The slow processor 
and its memory will handle one term of the matrix and vector product 
during each cycle of the machine and the fast processor will handle two 
products during approximately the same time period with the results of 
both processors accumulated in one accumulation. Figure 27 illustrates 
in more detail the interleaving circuit operation to couple the fast and 
slow processor to compute a single row of the vector, matrix product. 
Summary 
101 
This chapter has dealt with the formulation and design of a discrete 
Kalman filter using distributed architecture and circuit optimization. 
The system to be filtered was first modeled and the necessary calcula-
tions were performed to produce the Kalman filter equation and constants 
required to filter the system model. Next the distributed design was 
optimized to the design constraints and circuit block diagram were gener-
ated. This sequence of steps has served to illustrate the process of 
implementing an optimal architecture design for a common class of filter-
ing problems. 
102 
MEMORY MEMORY 
UP DATE 14--
MEMORY 
+ 
• 
l f 1 
LOOK l MPLY I EXP I LOOK l MPLY I I EXP AHEAD- SLOW AHEAD- FAST 
+ 
!__,.. SHIFT H INC l SHIFT INC t-_ NETWORK DEC NETWORK DEC 
y- LATCH I GATE I 
+ I 
ADDER I ACCUMULATOR ,_ 
Figure 27. Circuit for the Kalman Filter 
CHAPTER VII 
CONCLUSIONS AND RECOMMENDATIONS 
Conclusions 
An algorithm for the design of special purpose distributed architec-
ture computer optimized to cost and constrained by time, power, and cir-
cuit size has been described and implemented in this thesis. 
A technique of using more than one processor to compute data during 
one time interval has become known as distributed processing and the 
structure of such systems is referred to as distributed architecture. 
With the reduction in size and cost of computing devices, special consid-
eration is being given to design methods for computers that are based on 
the mathematical structure of a special class of problems. By designing 
the computer to take advantage of the structure of the problem classes of 
problems having common characteristics can be computed efficiently in a 
special-purpose machine. For a class of problems such as recursive linear 
filters, vehicle navigation, and sonar receiving, the common mathematical 
structure is the vector, matrix multiplication algorithm. The structure 
of vector, matrix problem has led to a distributed architecture known as 
array processing and specific criteria defining this class of data pro-
cessing has been generating. 
The desing of array processing machines in the past has been based on 
maximum throughput of data technically possible at the time of design. An 
103 
104 
alternate concept has been presented in this thesis based on the optimiza~ 
tion of the computing system with respect to cost and constrained by time, 
power, and circuit size. One of the fundamental questions in design is 
whether or not the system is capable of meeting the specifications of the 
design for a particular application. Optimization methods have been shown 
helpful in reducing the design of an array processor to obtain the best 
trade-off of speed, power, and circuit size with special emphasis placed 
on overall cost. From the concept of optimizing an array processor, has 
come a multi-phase processor design that utilizes different speed pro-
cessors in a common circuit to better meet the overall design requirements. 
The basic criteria for the design has been generated from the definition 
of array processing and these rules of design can be followed through each 
phase of the design process as shown in Chapters IV and V. 
Finally, the design algorithm has been put to use in Chapter VI to 
produce a multi-phased array processor to implement a Kalman filter. This 
same algorithm can be employed to generate a circuit structure for any 
similar structured problem that requires vector, matrix multiplications. 
It is felt that the design of computers should, in certain classes 
of problems, be based on the problem strucutre or problem algorithm as 
well as the hardware present to construct the system. The entire system 
should be optimized with available optimization techniques to reduce the 
hardware structure to the best possible state. 
Recommendations for Further Research 
Within the design framevwrk established by this thesis, several addi .. 
tional areas of research arise for the application of the proposed optimi-
zation design algorithm. The problems are for the most part concerned 
105 
with increasing the design to take into account all the parameters in the 
problem algorithm, the architecture structure, and the hardware compo-
nents. 
Various techniques for optimizat~on and component reduction are pres-
ent in the literature and with some modification these tools can be em-
ployed to aid in future system designs. The problems of cost, speed, 
power and size are being studied at the chip design level but little has 
been done to meet these problems at the system level. System design with 
the aid of computers is a fast growing area and large programs of an 
interactive nature are needed to speed the design of systems and take into 
account all the aspects of the problem at one time. There presently 
exists several software packages to do optimization problems as well as 
software to do some of the other steps in total system design and simu-
lation. These software packages can be integrated together to form one 
design package capable of assisting a designer in producing a practical 
circuit to solve a problem. 
The further changes in hardware can continue to affect the design 
considerations and processes. Floating point units are all but present 
today and their production will cause changes to the design of array 
processor systems as well as all other types of computing circuits. The 
interconnection of these future special-purpose chips will be of concern 
to design engineers in the future and better ways of using and keeping 
up with the new hardware must be found. 
As the computer area and digital design area mature, more comples 
mathematical techniqes for the processing of data in real time applica-
tions will come into use. These methods will enhance the data processing 
capability of the future provided adequate means can be provided to design 
106 
and implement computers to perform these tasks. The primary area of con-
cern for the design of these systems will still be speed, power, size, and 
cost of the system. Regardless of the new algorithms made available, the 
new chip produced, and the new problem to be solved, the constraints of 
cost, speed, power, and size are forever present. 
SELECTED BIBLIOGRAPHY 
1. Unger, S. H. "A Computer Oriented Towards Spatial Problems." IRE 
Proceedings, Vol. 36, October, 1958, pp. 1744-1750. 
2. Holland, J. H. "A Universal Computer Capable of Executing an Arbi-
trary Number of Sub-Programs Simultaneously.~~ Proceedings of 
the Eastern Joint Computer Conference, 1959, pp. 108-113. 
3. Comfort, W. T. "A r~odified Holland Machine." Proceedings, Fall 
Joint Computer Conference,l963, pp. 481-488. 
4. Slotnick, D. L. "The SOLOMON Computer. 11 Proceedings, Fall Joint 
Computer Conference, 1962, pp. 97-107. 
5. Gonzales, R. A. "A Multi-Layer Iterative Circuit Computer. 11 IEEE 
Transactions on Electronic Computers, Vol. 12, December, 1963, 
pp. 781-790. 
6. Barnes, G. H. "The ILLIAC IV Computer." IEEE Transactions Q.Q_ Compu-
ters, Vol. 17, August, 1968, pp. 746-757. 
7. Pariser, J. J. and Maurer, H. E. 11 Implementation of the NASA ~1odular 
Computer with LSI Functional Characters." AFIPS Conference Pro-
ceedings, Fall Joint Computer Conference, 1969, pp. 231-245. 
8. Oere, W. Y. and Sakrison, D. J. 11 The Berkeley Array Processor." IEEE 
Transactions on Computers, Vol. C-19, No. 5, May, 1970, pp. 444-
447. 
9. Cannon, L. E. 11 A Cellular Computer to Implement the Kalman Filter 
Algorithm." Ph.D. Thesis, Montana State University, August, 
1969. 
10. Weissberger, A. "Analysis of Multiple-Microprocessor System Architec-
ture." IEEE Transactions on Computers, Vol. C-26, No. 5, June, 
1977' pp. 151-163. 
11. Bowra, J. W. and Torng, H. C. "The ~1odeling and Design of r~ultiple 
Function-Units Processors. 11 IEEE Transactions on Computers, Vol. 
C-23, No. 4, March, 1976, pp. 210-221. 
12. Flynn, f>'l. J. 11 Some Computer Organizations and Their Effectiveness." 
IEEE Transactions on Computers, Vol. C-21, No. 9, September, 
1972, pp. 948-960. 
107 
108 
13. Enslow, P. H. Nultiprocessors and Parallel Processing. New York: 
Wiley-Interscience, 1974. 
14. Stone, A. L. 11 Parallel Computers. 11 Introduction to Computer Archi-
tecture, SRA, 1975, pp. 318-374. 
15. Lipovski, G. J. and Doty, K. L. 11 Developments and Directions in 
Computer Architecture. 11 Computer, August, 1978, pp. 54-67. 
16. Tang, C. K. 11 Cache System Design in the Tightly Coupled Multipro-
cessor System. 11 Proceedings of the National Computer Confer-
ence, 1976, pp. 749-753. 
17. Hayes, J. P. Computer Architecture and Organization. New York: 
McGraw-Hill Book Company, 1978. 
18. Aho, A. V., Hopcroft, J. E. and Ullman, J. D. The Design and Analy-
sis of Computer Algorithms. Mass.: Addison-Wesley, 1974. 
19. Stratonovich, R. L. 11 Conditional Markov Processes. 11 Theory of Prob-
ability and its Applications, Vol. 5, No. 2, 1960, pp. 156-178. 
20. Storenson, H. W. and Stubberud, A. R. Linear Estimation Theory. 
National Technical Information Service, U. S. Department of 
Commerce, 1970, pp. 3-41. 
21. Fenwick, P.M. 11 Binary Multiplication with Overlapped Addition 
Cycles. 11 IEEE Transactions on Computers, Vol. C-21, No.6, 
January, 1969, pp. 71-74. 
22. McDonald, T. G. and Guha, R. K. 11 The Two 1s Complement Quasi-Serial 
Multiplier_.~ IEEE Transactions on Computers, Vol. C-22, No. 4, 
December, 1975, pp. 1233-1235. 
23. Waser, S. and Peterson, A. 11 Real-Time Processing Gains Ground with 
Fast Digital Multipliers.~~ Electronics, September, 1977, pp. 
93-99. 
24. Wallace, C. S. 11 A Suggestion for a Fast Multiplier. 11 IEEE Trans-
actions on Electronic Computers, February, 1964, pp.-=!4-17. 
25. MacSorley, 0. L. 11 High Speed Arithmetic in Binary Computers. 11 Pro-
ceedings of the IRE, January, 1961, pp. 67-91. -
26. Parasuraman, B. 11 1-lardware Multi p 1 i cation Techniques for f>1icropro-
cessor Systems. 11 Computer Design, April, 1977, pp. 75-82. 
27. Geist, D. J. 11 MOS Processors Pick-up Speed with Bipolar t~ultipli­
ers_.1 Electronics, July, 1977, pp, 113-115. 
28. Pritchard, R. L. 
Technology. 
Trends j__Q_ Integrated Electronics_ and ~iicroprocessor 
General Electric Report No. 77CRD070, May, 1977. 
109 
29. Torng, H. C. and Wilhelm, N, C. 11 The Optimal Interconnection of Cir-
cuit Modules in Microprocessor and Digital System Design.~~ IEEE 
Transactions on Computers, Vol. C-26, No, 5, May, 1977. 
APPENDIXES 
110 
APPENDIX A 
Mixed INTEGER LINEAR PROGRAM 
Purpose 
This programs finds the minimum of a multivariable, linear function 
subject to linear constraints, ih which some or all of the variables may 
be restricted to integer values: 
Minimize F=c1x1+c2x2+ ... +CNlXNl+CNlyNl+l+ ... +CNYN 
Subject to A .. x.+A.kYk ,=,B. i=l, ... ,m 
1 J J 1 1 
j=l, .... ,Nl 
k=Nl+l, ... ,N 
Xj are each integer and subject to an upper bound 
xj, v k o. 
He thad 
The algorithm is based on the Lan& and Doig method. A dual simplex 
algorithm is imbedded in the program to obtain the starting, continuous 
solution and evaluate each integer trail. The specified integer variables • 
are tested one at a time in paired values to establish direction and value. 
The algorithm is as follows: 
1. The algorithm employs a dual simplex linear programming algorithm 
(not product form) hereinafter referred to as the LP. The tableau 
111 
112 
is carried in compact Tucker form: the initial number of rows 
equals the number of problem constraints plus one; the initial 
number of columns equals the number of true variables plus one. 
Whenever a zero~constrained slack variable becomes non-basic, it 
is removed from the problem, resulting in a reduction by one of 
the number of columns in the tableau. Zero-constrained slack 
variables arise from two sources: equality constraints in the 
initial tableau; constraining a basic integer variable to an 
integer value (see 4 below). The number of rows in the tableau 
remains constant throughout. 
2. Carry out an LP on the initial tableau. Print the solution. 
Check to see if all integer variables are integer valued. If so~ 
the problem is terminated; if not, set the initial tolerance for 
the problem. (Tolerance is defined as the value below which the 
objective function must stay in order for a continuation of the 
current sequence of integer-constrained integer variables to be 
considered as a candidate for the mixed integer solution. Note 
that the objective function value at the continuous solution 
represents an absolute lower bound for the mixed integer solu-
tion.) Set to l the index of the integer variable being con-
strained. 
3. Choose from those integer variables which are non-basic in the 
current tableau the one with highest coefficient in the objective 
function (shadow price). (The program makes use of the fact that 
the shadow price represents an underestimate of the increase in 
the objective function associated with constraining the non-basic 
integer vari~ble to 1.) If no non-basic integer variable exists, 
113 
go to 4. Otherwise, store the current tableau and constrain the 
variable chosen to zero. This is done simply by removing the 
corresponding column from the tableau. (A non-basic variable is 
constrained to a non-zero integer value by adding the product to 
this value with each element in the corresponding column in the 
constant column of the tableau. The corresponding column is then 
removed from the tableau.) Go to 6. 
4. Store the current tableau. Consider all integer variables X; 
which are basic in the current tableau (there must be at least 
one) with value X~. For each X; determine the absolute differ-
ence between the increase in the objective function associated 
with the initial LP pivot step when x1 is constrained to [X~] 
f 
and when x. is constrained to [X.]+ 1. Choose as the integer 1 1 
variable to be constrained that X; for which this difference is 
a maximum and constrain it to the value yielding the smaller 
increase. The actual constraining is accomplished by adding the 
integer value to the constant column of the row corresponding to 
to variable, and then stipulating that the row corresponds to a 
zero-constrained slack variable. Carry out an LP. If the objec-
tive function stays within the tolerance go to 6; otherwise go to 
5. 
5. If the current integer variable was constrained to [X!], record 
1 
the fact that constraining it to values cxfJ- k (k = 1' 2, ... ) 
within its range need not be considered. Conversely, if x. was 
1 
set to [xfJ + l, make note that values [X~]+ 1 +kneed not be 
considered. Go to 9. 
114 
6. Test the constrained variable index, If it is equal to Nl, the 
number of integer variables in the problem, go to 9. Otherwise 
increase it by one and go to 3. 
7. Decrease the constrained variable index by one and test it. 
8. If it is zero go to 11. Otherwise go to 9. 
9. Determine for the integer variable corresponding to the current 
value of the index whether its range has been exhausted (explic-
itly or implicitly) on neither, on one or on both sides of its 
current value. If it has been exhausted on both sides, go to 7. 
If the variable to be constrained has been exhausted on one side, 
constrain it to the unexhausted integer value closest to its cur-
rent value in the proper direction. If the range is unexhausted 
on either side, determine in which direction to go using the 
method employed in 4, and proceed as for only that side open. 
(Note that the range of an integer variable which was non-basic 
when constrained is immediately exhausted from below.) Carry 
out an LP. If the objective function stays within the tolerance 
go to 6. Otherwise, note that the range of the current variable 
is exhausted in the direction in which its current value lies 
from its original value (see 5). Go to 9. 
10. A better feasible mixed integer solution has beenobtained. Print 
the solution. Replace the tolerance by the objective function 
value. Go to 8. 
11. For the current tolerance, all ranges of all the integer vari-
ables have been exhausted. If at least one feasible mixed inte-
ger solution has been obtained, the last printed solution is an 
optimal solution to the mixed integer problem and the problem is 
115 
terminated. Otherwise, the tolerance is increased, the con-
tinuous solution tableau is restored, the index of constrained 
integer variables is set to one, and control goes to 3. 
If the program is terminated abnormally, the last printed feasible 
mixed integer solution (if any) is the best obtained. A flow diagram 
illustrating the above procedure is shown in Figure 28. 
Program Description 
1. Usage: 
The program consists of a main program only. Program size, sol-
ution estimate, and tableau coefficients along with control 
parameters are read in. The objective function to be minimized 
is the first row of the tableau. 
2. Subroutines Required: 
None. 
3. Description of Parameters: 
!SIZE Intermediate storage area= NZRlVR*(2*N-NZRlVR+l)/2 or 
as large as possible. 
NMRUNS Number of runs or problems to be solved. 
IOUT2 Print control for initial working tableau: 
0 = No print 
IOUT3 
I PACK 
1 = Print tableau. 
Pri~t control for continuous solution tableau: 
o = No print 
1 = Print tableau. 
Matrix format: 
0 = Unpacked, read all coefficients 
Read Control 
Parameters and ~--------------~ 
Problem 
Define Row, Column 
and Tableau 
Pointers 
Per.form Dual 
Simplex to Obtain 
Optimum Non-
Integer Solution 
Set Limiting 
Tolerance as 
Function of 
Objective Value 
Calculate Objective 
Function (Dual 
Simplex) for Each 
[x. ) and Each 
~[x.) + 1 
Select X. Causing 
Largest Difference 
in Objective Function 
and Value Causing 
Smallest Increase 
A.dd [x. J + B .• 
~ ~ 
Set Slack Index 
to 
Figure 28. Mixed Integer Linear Programming Logic Diagram 
116 
Select Variable 
With Largest 
Objective 
Coef.ficient 
Save Current 
Tableau 
Drop Selected 
Variable Column 
From Working 
Tableau 
~------15 
Perform Dual 
Simplex 
No 
Increase X. 
l. 
Index by 1 
Replace Tolerance 
With Objective 
Value 
5 
Figure 28. (Continued) 
No 
Set lndex on X. 
1 
to Specify Constrained 
in One or Both 
Directions 
Yes 
Decrease X. 
- l. 
Index by 1 
No 
Increase Tolerance, 
Restore Tableau, 
Set X. Index 
l. 
to 1 
2 
117 
118 
1 ~ Packed, read non-zero coefficients only, 
SOLMIN Estimate of objective function if known, zero otherwise. 
PCTTOL Tolerance as fraction of objective function for contin-
uous solution (may be left at zero). 
M Total number of rows. 
N Total number of columns equals sum of X and Y variables 
plus 1 for constraints. 
NMl DO loop parameters: NMl = N- 1. 
NZRlVR Number of integer variables. 
UPBND Vector of integer variable's upper bounds; size= N- l. 
IROW Vector of constraint types; size= M- 1: 
+1 b. 
1 
0 = b. 1 
-1 b;. 
ITEMP Column of coefficients being read in row i including 
objective row. 
VAL Coefficient value of columns specified by ITEMP for 
row i. 
ATAB Initial working tableau, N x M array. 
NI Card reader unit number. 
NO Printer unit number. 
4. DIMENSION Requirements: 
The COMMON* and DOUBLE PRECISION statements in the main program 
should be modified according to the requirements of the largest 
problem in the set being run. The parameters included in the 
following statements conform to the Input Parameter definitions 
119 
above: COMMON IROW(M), ITBROW(M), ICOL(N), ITBCOL(N), IVAR(N), 
ISVROW(M,NZRlVR), ISVRCL{NZRlVR), ICORR(NZRlVR), ISVN(NZRlVR), 
KSVN(NZRlVR+l), DOUBLE PRECISION ATAB(M+l,N), UPBND(N+l), 
TPVAL(NZRlVR+l), BTMVL(NZRlVR+l), VAL(NZRlVR+l), TBSAV(M,N), 
SAVTAB(M+l,NZR1VR*(2N-NZR1VR+l)/2), T(N). 
5. Input Formats: 
CARD TYPE FORMAT CONTENTS 
l (20I4) ISIZE, NMRUNS 
{Appears only once per program execution.) 
2 (55H ) Problem title, identification 
3 
4 
5 
6 
7 
8 
{Put 1 in card column 1 for printer page control.) 
(20I4) 
(7El0.0) 
(20I4) 
(7El0.0) 
IOUT2, IOUT3, IPACK 
. SOLMIN, PCTTOL 
M, N, NZRlVR 
(UPBND( I), I=l, NMl) 
{If NZRlVR exceeds 7, additional CARD TYPE 6's 
required.) 
(20I4) (IROW(I), I=2,M) 
(If M exceeds 20, additional TYPE 7's required.) 
If IPACK = 1 
{7(!3, E7.0)) (ITEMP(K), VAL(K), K=l, 7) 
(If more than 7 non-zero coefficients exist, addi-
tional TYPE 8's required. Last TYPE 8 card must 
end with zero field. If last card full, insert 
blank card.) 
9 If IPACK = 0 
(7ElO.O) (ATAB(I,J), J=l, N) 
120 
(one TYPE 9 per row including objective fct. If 
N exceeds 7, additional TYPE 9's per row required.) 
6. Output: 
The main program prints out the problem title supplied, print 
control parameters, problem size and number of integer variables, 
bounds on the integer bariables, codes for the constraint types, 
and the matrix format type code as part of the initial data. 
The coefficient tableau is printed as raw data for checking 
purposes. 
If IOUT2 = 1, the initial working tableau (as input to the 
first dual simplex solution) is printed in the Tucker form as 
used. 
If IOUT3 = l, the tableau from the continuous solution is 
printed. 
The objective function value and values of each variable are 
printed for the continuous solution and for each feasible integer 
solution along with the present iteration number. 
Error messages are printed for abnormal terminations sugges-
ting the reason and giving the iteration number. 
7. Summary of User Requirements: 
a) Determine values for each problem set for SOLMIN, PCTTOL~ M, 
N, NZRlVR, UPBND, IROW, NMRUNS, NI, and NO. 
b) Calculate intermediate storage area for ISIZE. 
c) Define code for matrix type for each problem. 
d) Specify print control criteria for IOUT2, IOUT3. 
e) Adjust COMMON size statements as needed to hold largest prob-
lem or satisfy machine limits. 
121 
f) Adjust FORMAT statements as necessary, 
The FORTRAN program contained in this section is based on Branch and 
Boun..9_Mixed Integer Programming, described on page242of 11 Catalog of Pro-
grams for IBM System 360 Models 25 and Above, 11 GC 20-1619-S;_program num-
ber 3600-15.2.005. Used by permission of International Business Machines 
Corporation. 
APPENDIX B 
LINEAR EQUATION BOUNDRY PLOT PROGRAM 
As discussed in Chapter 4, the following program was used to plot 
ans study the design parameters for a Multi-phase array processor circuit. 
The intent of the plotted data is to show the area in which the solution 
of the linear integer program is contained. This is done in such a way 
that alterations to the design can be introduced with ease. The program 
is written in Fortran and was executed on the IBM 370/158. 
122 
$JOB TH-IEs tO 
l DIMENSION DATA11!531oDATA2151l,lSYMBI31 
2 INTEGER M,N,SPE~D1,SP:E02,PDWtR1,POWER2,UN!TS1,UNITS2,TC 
3 INTEGER SlZE,DELAY,INTER,JNTER2,L!M!Tl,L!MITZ,POWl,POW2,UNl,UNZ 
4 
5 
6 
7 
a 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
2B 
29 
30 
31 
32 
33 
34 
35 
36 
31 
38 
39 
40 
41 
42 
43 
44 
45 
46 
47 
48 
49 
50 
C~****••••**•*********•**~•••••••••••******************* 
C DATA SECTIO~ FOR PARAMETERS OF LINEAR EQUATIONS 
c 
c 
c 
c 
20 
21 
c 
c 
c 
30 
25 
DATA ISYMI:li'T','P 1 , 1 A'/ 
ZERO 0.0 
M ,. 15 
N ,. 20 
SPfED1 SOO 
SPFED2 1000 
POW~Rl : 45 
POWERZ~ 15 
UN!TSl " 2 
UNIT 52= 6 
TC = 1000 
INTER TC I SPEEOl 
SIZE = H * N 
DELAY = SPEED2 I SPEEDl 
INTER2 INTER I DELAY 
LIM!Tl = SIZE I INTER 
liMIT2 SIZE I INTER2 
PllWl L !Ml T l * POWERl 
POW2 L1MIT2 * POWER2 
UNl LIH!Tl * UNITSl 
UN2 LIM!T2 * UNITS2 
AGDON FLDATCIAI:lSIP~~l - POW21141 
ADDUM FLOATIIAdSIUNl - UN2l/41 
~ATTS FLJATIPO~ll 
AREA FLOATIUNU 
CALCULAT~ THE OATA POINTS FOR THE EQUATIONS 
O'J 60 1<=1,4 
POINT flDATILIHITll/51.0 
POINT2 = li.O 
D::J 20 J=l.51 
O~TA2lJI = POINT2 
TP2 = lFLuATISll.El - OATA21JI *FLOAT( INTER! l/FLOAHINTER2) 
DATAHJI = Tl'2 
kP2 = (wATTS - DATA21JI * FLOATIPOWERllliFLOATIPOWER21 
lF(WP2.LT.lEROIWP2 = 0,0 
D6.TA115l+JI = WP2 
UP2 = !AReA- DATA21JI • FLOATIUNITSliiiFLOATlUNITSZI 
IFIUP2.LT.lcRDIUP2 0.0 
DATAlllli2+JI = UP2 
POINT2 = PO!NT2 +POINT 
CONTINUE 
1;<1.(1[(6,211 
FORHATI1Hio3X.6HX-AXES,4X,6HY-TIME,4X,7HY-POWER,3X,6HY-AREA,/) 
P~INT THE OATA POINTS OF X AND Y 
00 30 I~l. 51 
W'< I TE I 6, 2 5 J D AT A2 I l J , DATA II I l , 0 AT A 1 t I+ 5 11 , DATA 11 I +1 0 21 
CC'NT lNUE 
FC' R I' AT I 4 F 1 0. 31 
WR IT E I 6, lll NT E R, INTER 2, SIZE 
"123 
51 
52 
53 
54 
55 
56 
57 
58 
59 
60 
61 
62 
63 
64 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
65 
66 
c 
c 
c 
67 
68 
69 
70 
71 
c 
c 
c 
12 
73 
74 
75 
76 
77 
78 
79 
80 
wRJTEI6,2JPOwEKloPOnER2oWATTS 
WR I 1E I 6 , 31 UN I T S l , UNIT S 2 , ARE A 
FORMATilHl,lO,,~HTIHEI,4X,I3,2X,2HP1,2XolH+,2X,J3,2X,2HP2,2X,lH•,2 
1 X, I 61 
2 FOR~ATilH ,lUX,&rlPOWERI,3Xoi3,2X,2HP1,2XolH+,2X,I3,2X,2HP2,2X,lH•, 
1Fl0.3l 
3 FOQMATilH ,lJX,SHAREAI,4X,I3,2X,2HP1,2X,lH+,2Xo13r2Xr2HP2,2X,lH•,2 
lXo FlO. 3,//1 
CALL YPLCT ISl, 5l,Sl,UATA2,0ATA1,3, I SYHB 1 6l 
WATTS = WATTS - AIJOUN . 
ARE~ • AREA • ADDUH 
tO CONTINUE 
hP.ITEib,SvOI 
500 FORMATilHU 
STOP 
E'IIO 
SUBROUTINe YPLOTIIX,IY,NPNTS,X,Y,NCRVS,ISYHB,IOUTJ 
THIS IS A Y=FIXJ PLOT ROUTINE THAT FEATURES: 
1. VARIA~LE HEI~HT AND WIDTH OF PLOTS 
2. AUTOMATIC SCALING 
3. HORILJNTAL X AXIS AND VERTICAL Y AXIS 
4, MULTIPLE CuRVES IN A SINGLE PLOT 
INPUT PARAI'IETERS 
IX NUM!;ER JF COLUMNS IN PLOT 
IY NUMBER JF RJWS IN PLOT 
NPNTS i~UM!Jt:R OF DATA POINTS PER SET 
X VcCTOil. OF X VALUES 
y VE:CTOR OF y VALUES 
NCRVS NUMtlER OF CURVES TO BE PLOTTED 
ISYHB Vt:CTOR OF ALPHANUM~RIC SYMBOLS TO BE 
lOUT LOGICAL UNIT NU~BER FOR OUTPUT DEVICE 
USED IN PLOTS 
DIMENSIIJN I~RPHilOO,lOOI,XSCALllOOirYSCALilOOI,ISYMBillo 
* X('OPNTS/,Yill 
DATA lBLII/K,IPLUS,MINUS/ 1 '•'+','-'/ 
INITIALIZE ARRAY TV tiLANK 
DO 150 1=1 .IX 
DIJ 1 00 J =1 ,1 Y 
IGRPHI J,JI =18LNK 
lOO CO'IT!NUE 
150 CONTINUE 
DETERMINE MINIMUM AND MAX!HUH X AND Y VALUES 
XMAX=XI 1) 
XHIN=XIlJ 
YMH=YI U 
Y~IN=YI 11 
DO 200 1=2,NPNJS 
IF lXII I.GT.XMAXI XHAX=XIII 
IF I XI I I.LT .Mil Nl XI'HN=X( [I 
200 CONTINUE 
NYPTS=NCRVS*NPNTS 
124 
Bl 
82 
83 
84 
85 
86 
87 
88 
89 
90 
91 
9Z 
93 
94 
95 
96 
97 
98 
99 
100 
101 
102 
103 
104 
1 OS 
106 
107 
108 
109 
110 
111 
112 
113 
114 
115 
116 
117 
118 
119 
120 
121 
122 
123 
300 
c 
c 
c 
250 
c 
c 
c 
1000 
2000 
c 
c 
c 
3000 
3050 
c 
c 
c 
4000 
5000 
l 
2 
3 
4 
DO 300 I=loNYPTS 
IF I Y I I I • G T • Y KA X I YMH~YIII 
IFIY!li.LT.Y~INI Yli!N=Yill 
CONTINUE 
TEST FOil. FLAT LINE 
IFI Y~AX ,N[;.YMI Nl GO TO 250 
\'HALF= YMAX/2, 0 
IF!YHALF.EU.O.O) tHALF=l.O 
YMA X= 'Oo A X. YHA L F 
Y'IIN=Y~I N-tHALF 
CONTINUE 
RECORD PLOT DATA IN ARRAY 
00 2000 J=l,NCRVS 
JCR.V=NCP.VS-J•1 
00 1000 l=1.NPNTS 
ISURX=IFIXI!XIIl-XMINl/IXHAX-X~INI~FLOATIIX-11+,4991+1 
11=1 +IJCRv-11 *llPNTS 
IS UB Y= I F I..; I I Y I Ill -'Oil N l II YMA X- YH I Nl '*F LOA Tl I Y-11 +. 4991 +1 
!GRPHI!SUdAoiS~8YI=ISYH8!JCRVI 
CONTINUE 
CONTINUE 
COMPUTE SCALED VALUES FOR X AND Y 
DO 3000 l=l,IX 
XSCALI!I=FLJATII-11/FLOATIIX-ll*IXHAX-XMINI+XHIN 
CONTINUE 
00 3050 1=1,IY 
YSCAL11l=FL04TII-ll/FLOATIIY-1l*IYHAX-YMINI+YMIN 
CONTINUE 
PRINT OR DISPLAY A~RAY OF PLOTTED DATA 
DO 4000 l=lolY 
LINE=IY-1+-1 
K=LINE-1 
lFIKIS*S.EO.KI~~ITEIIOUTo21YSCALILINEI,(JGRPH(J,LINEl,J~l,IXI 
!F(K/5*5.NE.KI "Rl TEIIOUT,ll I IGRPHIJ,LINEI ,J=l,lXI 
CONTINUE 
DO 5000 I=l.IX 
I G R PH I I , ll = H I NUS 
lFI(l-llli.v•lO.E;),l-11 IGRPHII,ll=lPLUS 
CONTINUE 
WRITEIIOUT,31 IJ(;RPH(J,li,J=l,IXI 
WRITEilOUf,4j IXSCALIJI,J=1r1XrlOI 
R'=TUllN 
FORHATIIBX,' I' olUOAlJ 
FORMAT(8X,El0.4r'+' .lOOAll 
FORMATil9X,lOOAII 
FORMA1116X.lOif6.lo'tXII 
END 
CH BSYS 
SFNTRY 
125 
126 
X-AXES Y-TIME Y-PDwER Y-AREA 
o.ooo 300.000 451J.UOO 50.000 
2.9.1;1 294.117 441.176 49.020 
5.882 2!!8.235 432.353 411.039 
a. a 24 282.353 42.3.529 47.059 
ll.H:S 276.470 -4l<t.7ub 46.078 
14. 7C6 270.588 40'>.1HI2 45.098 
17.6-47 264.706 397.059 44. 118 
20.588 258.823 3 88.2:15 43.137 
23.529 252.<J41 379.'-12 42.157 
26.4 71 247.059 370.588 41.176 
29.412 241.177 3&1.7<:.5 40.196 
32.3 53 235.294 35.<.941 39.216 
35.294 229.412 344.118 38.235 
38.235 223.530 335.294 37.255 
41.176 217.647 3 26. 't1l 3&.275 
44.118 211. HS 317.b47 .35.294 
47.059 205.883 3011.824 34.314 
so. 000 200.000 30u.ouo 33.333 
~2.S41 l<J4.ll8 291.177 32.353 
55.882 188.236 282.353 31.373 
58.823 182.353 273.530 30. 39Z 
tl.1f4 176.471 264.71J7 29. 412 
64. 7C6 170.589 255.883 28.431 
67.t47 164.706 247 ,Ubu 2.1. 451 
10.5ea 158.824 238.236 26. 4 71 
73. ~ 29 152.942 22'1.413 25.4<JO 
7£o.4 70 147.059 22u.589 24.510 
79.411 141.177 211.766 d. 530 
82.353 135.295 2 02. 9'<2 22.549 
f5.294 129.412 19'<.ll9 21.569 
88.235 123.530 185.2.'15 20. 588 
91.116 117.648 176.472 19. 608 
S4, 111 111.765 167.6'<8 18. 6 28 
'17.058 105.883 151:1.b25 17.647 
lCO,OOO 100.001 150.001 16.667 
102.941 94.118 14i.l78 15.686 
105.882 88.236 132.35ft 14. 706 
108.823 82.354 123.531 13.726 
111. H4 76.472 114.707 12.745 
114.705 70,589 1 OS .BB'< 11. 7 65 
117.647 64.707 97.060 lll.784 
120.588 58.825 Se.237 9.804 
123.529 52.942 79.4H 8.824 
126.4 70 47.060 7u."i91) 7.843 
129.411 41.178 61.766 6.863 
132.3~2 35.295 52.94 3 5. 883 
135.2'J4 29.413 44. 12 0 4.902 
138.235 23.531 35.2 96 3. <J22 
141.176 17.648 26.473 2.941 
")44.117 11.766 11.649 1. 961 
147.058 5.884 8.&26 0.981 
TJ ME I 
POwER I 
AREAl· 
0.4500E 03+P 
I 
I 
I 
I 
0.4051E 03+ 
I 
I 
I 
I 
0.3602E 03+ 
I 
I 
I 
I 
0.3153E 03+ 
I 
p 
ITT 
I 
I 
0.2704E 03+ 
I 
I 
I 
I 
0,2255E 03+ 
I 
I 
I 
I 
O.l806E 03+ 
I 
I 
I 
I 
O.l357E 03 .. 
I 
I 
I 
I 
0.9C7BE 02+ 
I 
I 
I 
I 
2 Pl + 
45 Pl * 
2 Pl * 
p 
p 
p 
p 
p 
p 
p 
p 
p 
T 
TT 
T 
TT 
T 
TT 
C,4588E 02+AAAAAAAAA 
p 
p 
p 
T 
TT 
1 P2 
15 P2 
f> P2 
p 
p 
p 
p 
p 
p 
T 
TT 
TT 
T 
p 
p 
p 
p 
p 
TT 
T 
TT 
300 
67 50.000 
300.000 
p 
p 
p 
pp 
p 
p 
p 
p 
T p 
TT 
T 
TT 
T 
TT 
T 
p 
p 
TT 
I AAAAAAAAA 
I AUAAAAAAA 
I AA.AI.AAAAA 
I 
0.9806E 00+ 
p 
p 
p 
p 
p 
T p 
TT p 
T p 
TT p 
T p 
TTP 
TTP 
TP 
AAAAAA.AAA. TT 
AAAA.A 
+---------+---------·---------·---------+---------+ 
o.o 29.4 58.8 88.2 117.6 14 7.1 
127 
128 
X-AXES \'-TIME Y-POWER Y-AREA 
o. 0 00 300.000 4ll.533 112.500 
2.CJ41 294.117 403.7J9 !11.57.0 
5. 8 82 288,235 394,{)86 110.539 
8.824 282.353 3Bu,Oo3 109.559 
11.765 276.470 377.239 lOtJ,578 
14.706 270.588 368.416 107.596 
17.647 264.706 359.:>92 lOb, 616 
20.588 258.823 35ll.7o8 10~.637 
23.5 29 252.941 341.945 104.657 
26.411 247.059 333.122 103.676 
29.412 241.177 32 ... 298 11)2,691> 
32.353 235.294 315.475 101.716 
35.2CJ4 229.412 301:>.651 10ll.735 
38.235 223.530 297.s:a 99.755 
lo1.11c 217.647 28':1.U04 98.775 
44.118 211.765 280.181 97.794 
47.0~9 205.883 271.357 96.814 
50.000 200.000 262.514 95.833 
~2.941 194.118 253. 7Lu 'i4. 8 53 
~5.8E2 188.236 244.81J7 93.873 
58.8 23 182.353 236.0<>3 9l..89l 
l:l. 764 l 76 ,471 227.240 91.912 
l4.7C6 170.589 2l<l.'tl6 90.931 
(;7,647 164.706 2 09. ~9 3 89. 9 51 
70.588 158.824 20J.1b9 88, 971 
73.529 152.942 191.9't6 b7.990 
16.410 147.059 183.122 H7. 010 
79." ll 141.177 l74.2'J9 tlo. 029 
82.353 135.295 16:>.475 85.0'•9 
f5.2'i4 129.412 156.652 !l4. 069 
88.235 123.530 147.1>29 83.088 
91.176 117.6.48 139.0u5 IJ2,10S 
<;4.117 1ll.H5 130.182 81. 128 
c; 7. c 58 105.883 121.358 80.147 
JCO,OOO 100.001 112.~35 79. 167 
102.941 94.118 1 o.;. 111 71l. 186 
1C5.882 88.236 94.8&6 77. 201> 
lCB,823 82.354 86.()64 7o.221J 
lll.H4 76.472 77.241 75.245 
114.705 70.589 6tl.417 74.265 
117.647 64.707 5':1.594 73. 2 84 
120.588 58.825 su.no 72.304 
123.529 52.942 4l. 94 7 71.324 
126.470 47.060 33.123 70.343 
129.411 41.178 24.300 69.363 
132.3~2 35.295 15.47b 68.383 
135.294 29.413 6.653 67.402 
138.235 23.531 o.oou b6.422 
141.176 17.6 48 u.ouo 6>.441 
144.117 11.7 66 O.OJO 64.461 
1lo7.058 5.884 u.uoo 63.481 
TIME I 
POWER I 
AREA( 
0.4125E 03+P 
I 
I 
I 
I 
0. 37l3E 03+ 
I 
I 
I 
I 
0.3300E 03+ 
I 
I 
I 
p 
ITT 
0.2888E 03+ 
I 
I 
I 
I 
0.2415E 03+ 
I 
I 
I 
I 
0. 2C63E 03+ 
I 
I 
I 
I 
O.l650E 03+ 
I 
I 
I 
I 
O.l238E 03+ 
IAA 
I 
I 
I 
O.S251E 02+ 
I 
I 
I 
I 
0.4125E 02 + 
I 
I 
I 
I 
O.OOOOE 00+ 
2 Pl + 
45 p 1 .. 
2 Pl + 
p 
p 
p 
p 
p 
p 
p 
p 
p 
T 
TT 
T 
T 
TT 
T 
p 
TT 
AAAAA.AAA. 
p 
p 
T 
T 
1 P2 
15 P2 
6 P2 
p 
p 
p 
p 
p 
p 
TT 
T 
IT 
T 
AAAAAAAA 
p 
p 
p 
p 
T 
TT 
T 
p 
300 
6188.000 
675.000 
p 
p 
p 
p 
TT p 
T p 
T p 
TT p 
T p 
TT p 
T 
T 
AAAAAAAAA 
AAAAAAAA 
p 
TT 
T 
AAATTAAAA 
PT AAAAAAA 
PT 
PTT 
P T 
P TT 
p T 
p T 
p TT 
PPPP 
·---------·---------+---------+---------+---------+ 
o.o 2 9 .it 58.8 88.2 117.6 147.1 
129 
APPENDIX C 
KALMAN FILTER GAIN FINDING PROGRAM 
Tnis program simulates the Kalman Filter designed in Chapter 6 and , 
is used to obtain the gain values at each up date point. These gain 
values will be ultilized in the memory of the Multi-phase array processor' 
in on line operation. 
130 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
46 
47 
48 
49 
50 
XJlJ(21oKHl2o2loR 
EEEI 501 ,lllo/11( 501 
S JOB T I '"IE., 10 
DOUBLE 
D:JUBLE 
DOUIJL~ 
DOUBLE 
OOUBL f 
DOUBLE 
DOUBLE 
DOUeLE 
DOUBLE 
PRECIS I ON 
PRECIS I LN 
PRECISION 
PRcCI !> 1 ON 
PReCIS! ON 
PRcL!SIUN 
PRECIS I Ut< 
Ptl.EC.lSluN 
PRU.l SIGN 
X XI. I 2 I , F E I 2, 2 I , X J I 2 I , FE P I 2, 2 I , P J I 2, 21 
P J 1 J 12, 2 I , 01 2 .Z l , HP I 21 , HPR, KJ I 2 I , lll 2 51 t Z, HI 2) 
X J PI I lOO I , XJ P 21 100 I , K J P l( 100 lo K J P2 ( 100) 
XJlJDlllOOI,XJlJP2(1001 
"'t T I B ,oo 
6.,C,o,E,F,G 
YYri50I,RRRI501 
(~*******•*~*~*••*•*•*•******~*~************************ 
C OAT A SECT I ON 
c 
DATA R/l.ODO/ 
DATA XJ/O.ODu,O.ODO/ 
06TA PJ/1.00~,J.UDU,U.ODO,l.000/ 
C***********••••••~••••••••••••************************* 
C !;~PUT DATA FROM HODEL 
c 
Zllll 0.300 
lZ ( 2) 0.4~00 
ll 131 o. 500 
Zll4) o. 5500 
ZZI51 o. 5900 
ZZI6 l 0,60U 
ll (7) O.oS;)O 
ZZI8l o. 700 
lZI91 o. 7400 
Zl( 10 I u.739DO 
Zlllll o. 73cl<i00 
Zll12l 0. 731> 800 
zz ( 131 u. 736660() 
Zl ( 141 :Q, 732500 
ZZI15 I <1.734:.01) 
ZZI161 u.t4soo 
lZ 1171 u.74900 
ZZI18l ,. U.75DO 
ll ll9J = u.7SllDO 
ZZI20l :Q, 74600 
ZZI2ll u.7SDO 
ZZI22l 0.751100 
lll221 " u.7DO 
ZZI231 0.6500 
ZZI241 u.ssou 
lll25l O.!>DO 
c 
c•••••••••••~***~*~*******~****************~************ 
C F!NO THE FEE A~O Q MATRICES 
8 30.000 
K 27.000 
T 0,00100 
00 = l.ODO 
A I B I I B - 2.000 * K 
C ll.OD<J- DEXPI -4.000 
D ( B/12.000 * K - 8 l 
E (l.OOv- OEXP{-(2,000 * K + 81 * T 
F 2. ODD * K + B 
G (2,00<1 * K**2 * 0**2 * 11.000 - DEXP(-2.000 * 6 
011.11:( IK*A,**2 * Cl + (8.000 *K**2 * ~ * 0 * EI/F 
0!1,21= ( IK. • A * :1 + (4,000 * K**2 * 0 * Ell F I 
*TIIl/8 
+ G I * OQ 
*OQ 
131 
51 
52 
53 
54 
55 
56 
57 
58 
59 
60 
61 
62 
63 
64 
65 
66 
67 
68 
69 
70 
71 
72 
73 
74 
75 
76 
77 
78 
79 
80 
81 
82 
83 
84 
85 
86 
012.11 • Ollo21 
012,21 • IK * C I * OQ 
FE ( 1, 11 = De XI' 1-6 * T I 
Hllo21 =lA * DE:XPI-2.000 * K *Til .. 10 * DEXPI-6 * Tl I 
FE12oll = v.ODU 
F=12,21 = DEXPI-2.000 * K * Tl 
c•••••••••••••*•••••~•••••****************************** 
c 
C W~lTE: THE FEE AND 0 MATRICES 
50 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
WR. I T E I 6, 50 lUI 1 , 11 , 1J I 1 , 2 I , 0 I 2 , 1 I , Q I 2 , 2 I 
WR. I T E I 6, 50 l f- E I l, 11 , FE I 1 , 2 l , FE I 2 , 1 l , FE I 2, 2 I 
FC RMH! 40.2o.lb I 
I " 1 
COMPUTE THE GAINS AND DATA POINTS 
00 200 J=l,SO 
z "lllll 
IFIJ.GT.25ll 0.000 
XJlJill 
XJ1JI21 
F I ND X I J t- 1 I J I 
fEllo11 * XHll .. FE{l,21 * XJI21 
FE I 2 , 1 l * X J I 11 + f E I 2, 21 * XJ I 21 
F I NO P I J + 1 I J I 
FEPil,ll 
F!:Pil,Zl 
F~PI2oll 
FEP(2,21 
PJlJil.tl 
PJ1J(l,21 
PJlJI2.ll 
PJ1JI2,21 
FECloll * PJiloll .. FEilo21 * PJI2,1) 
Ftll.ll * PJC1,21 r F!"!l,2l * PJ!2,2l 
FEIZ.ll * PJil.tl .. FEC2,21 * PJIZ,ll 
FEI2,1J * PJ(l,21 + F':l2o2l * PJI2,2J 
IFEPI1.1J * FEll.ll i- FEPI1,21 * FE(l,21 I+ Qll,ll 
lfEPIJ.,U * FE12.ll .. FEPI1,21 * FEI2,21 I .. 011,21 
I Fl:P!2tll * FEiltll + FEPI2,2l * FEI1,21 I ~ Ql2 1 lt 
I FEI'I.:O.ll * FE12oll .. FEP!2,2l * FEI2,21 I+ 0(2,2) 
KJ I 11 
KJI21 
XJ Ill 
XJI21 
PJ{l,ll 
PJ 11,21 
PJI2tll 
PJI2,2J 
FIND GAIN VALUES 
PJlJ(l,ll I (PJlJiloll + Rl 
PJlJt2tll I tPJlJil.ll + Rl 
KALMA~ FJLTER EQUATIONS 
XJlJI 1 I + KJI 1) * ( Z - XJlJ( 11) 
XJ1J12l + KJ12l * ( l- XJlJ(l)J 
FIND PIJ+ll FOR NEXT UP DATE 
11.000- ~J(lJJ * PJlJil,lJ 
I l. 0 DCI - ~J I 111 * P J 1 J I 1, 2 I 
-KJI2l * PJlJ!l,ll + PJlJI2oll 
-KJ!2l * PJ1J(l,21 + PJ1J(2,21 
(***************~*•******~•********************~******** 
C PLAC.[ THE OAT A I~ ARFAYS FOR PRINTING 
c 
XJPHJI 
XJP21Jl 
KJPliJI 
KJP2(JJ 
XJlJPliJI 
XJI 1J 
XJI2t 
KJ ( l I 
KJI21 
= XJlJ(lJ 
132 
87 XJIJP21JI = XJ1JI21 
88 YYYIJI PJil.ll 
89 RRRIJI z PJI1,21 
90 WWW(J) = PJI2,21 
91 EEEIJI = PJI2, .. U 
92 I = I + 1 
93 IFII.GT.Z!>l I= 1 
94 200 CONTINUE 
95 
9b 
97 
98 
99 
100 
101 
102 
103 
104 
105 
lOb 
107 
108 
109 
110 
111 
112 
113 
114 
c~•**************~*********************'**************** 
C P~lNT SECTION 
c 
500 
100 
300 
bOO 
110 
400 
801 
700 
803 
WRITEI6,500l 
FO RMI>T I 1 Hl_ .1 J )(.I X J I u ' • 2 2X I I XJ I 2 I ', 2 2 X'. KJ I 11 I I 2 2 X. I KJ I 2) I, I/) 
DO 300 J=l.~O 
WRITEib,lOOJXJP11JI,XJP2(JliKJPl!JliKJP21J) 
FQRMAT(4026.l6l 
CCNT INUE 
wR I T E I 6, 600 I 
FORM AT I 1 Hi , i.l.l X • 1 X J lJ I 11 1 1 20X, 1 XJ 1 J 12) 1 ,//) 
DO 400 J=l,50 
WRITEI6,llJJXJlJPliJI,XJlJP21Jl 
F 0 R MAT I 2 02 6. l b I 
CO~TINUE 
h~ITE(6,8031 
00 700 J=i.50 
\.RITE I 6, 80 l J rY Yl J J , RR R I J I , EE E ( Jl , WWW( J t 
FOK.MAT(4D26.lbl 
CONTINUE 
FORMAT I lHll 
STOP 
END 
CHBSYS 
stNTRY 
133 
134 
KJ(lt KJI2J 
0.48544148845483140 00 0.35141232514324130-01 
0.31646519Y5372664D 00 0.11830723604873170 00 
0.23690407470461910 00 0.23969771249048480 00 
0.19615139612466800 00 0.38760287923121600 00 
0.17718526847508310 00 0.54784017550617940 00 
0.17189515946394690 00 0.70476248318071230 00 
0.17505012956368770 00 0.84389230697120460 00 
0.18265203376158440 00 0.95507182639296370 00 
0.19166240567097340 00 0.10343889518874460 01 
0.20005920670443990 00 0.10838636531276760 01 
0.20679421325753130 00 D.ll09363l09i378700 01 
0.21157699654696260 00 0.11180757148310860 Ol 
0.21458401412901250 00 0.11166180152565790 01 
0.21620182352405880 00 0.11101045607930600 01 
0.21685404480059400 DO 0.11019790472339620 01 
0.21690989085533670 00 0.10942575542405060 01 
0.2166503589&000150 00 0.10879093106550700 01 
0.21626738239732970 00 0.10832228556211600 01 
0.21587815032797500 00 0.10800971400173130 01 
0.21554399931680890 00 0.10782473111427990 01 
0.21528843145524190 00 0.10773363759080510 01 
0.21511193609356530 00 0.10770501794739000 01 
0.21500300582725690 00 0.10771327097227380 01 
0.21494560126727290 00 0.10773959216763190 01 
0.21492369775800240 00 0.10777149235427220 01 
0.21492365309027540 00 . 0.10780162131213850 01 
0.21493509164122830 00 0.10782639808142480 Ol 
0.21495088286287350 00 0.10784473979721430 01 
0.21496664920317980 00 0.10785702863888820 01 
0.21498010079227260 00 0.10786435560698360 Ol 
0.21499037826955320 00 0.10786802077435940 01 
0.21499749742767560 DO 0.10786924147494450 01 
0.21500192947560360 00 0.10786901225408550 01 
0.21500431438508730 00 0.10786806473993580 01 
0.2150D52862021413D 00 0.10786688556841610 01 
0.215005382670&6280 00 0.10786576192372980 01 
0.21500501223291700 00 0.10786483473733250 01 
0.21500445585389040 00 0.10786414801136870 01 
0.21500388675028640 00 0.10786368883041710 Ol 
0.21500339655056480 00 0.10786341660687040 01 
0.21500302095070010 uo 0.10786328239452740 01 
0.21500276132276060 00 0.10786324017397650 01 
0.21500260102623170 00 O.l076632522886f020 Ol 
0.21500251655235690 00 0.10786329103164810 01 
0.21500248433227260 00 0.10786333799415690 01 
0.21500248426220210 00 0.10766338234591750 01 
0.21500250112681780 00 0.10786341881470400 01 
0.2150025243712S760 00 0.10786344580901510 Ol 
0.21500254757506910 00 0.10786346389380850 01 
0.21500256737039120 00 0.10786347467621200 01 
APPENDIX D 
MATRIX COMPUTATIONS OF Q MATRIX 
The evaluation of the terms of the Q matrix started in Chapter 6 are 
continued in the following appendix. The solution begins with the matrix 
operations shown in figure 29 and is followed by a term by term integration 
of the matrix parts leading to the solution of the Q matrix . 
. 135 
Ae-2K(t- ·r) +Be -b(t- T )] [0 l [a 
e-2K(t--r) 2J 
J [ -b ( t- T) 2K e 
Ae-2K(t- -r) + Be-b(t- -r) 
= q / [0 
0 0 
[
4K2 Ae-2K(t--r)+Be-b(t--r) 2 
4K2e-2(t- T) Ae-2K(t- T) + Be-b(t- T) 
4 K2 e- 2 K ( t - T) Ae- 2 K ( t - -r ) + Be-b ( t - T) l 
2 -4K(t--r) 4K e 
Figure 29. Matrix Computations for the Q Matrix 
d-r • 
Evaluating the first term of the Q matrix gives 
where 
Part 1: 
Part 2: 
Part 3: 
A = b b - 2K and 
b B = 2k - b • 
T 
= q BK2ABe-(2K + b)T )( 0 (2K + b)< dT 
0 
= 8K2AB (l _ -(2K + b)T) 
2K + b . e 
Terms q12 = q21 and is evaluated by the equation 
137 
T 
q12 = q 14K2e-2K(t-T)(Ae-2K(t-T)+Be-b(t-T))dT. 
0 
Part 1: 
T 
q 4K2Ae-4KT 1 e4K'dT = 4~~A (l _ e-4KT) = KA(l _ e-4KT) 
0 
Part 2: 
T 
4K2B -(2K +b)T 1 (2K + b)Td q e e T = 
0 . 
4K2B (l- -(2K+b)T) q 2K + b e · 
Term q22 is determined to be 
T 
= q 4K2 -4KT1 4KTd = q22 e e T 
0 
The calculations result in the evaluation of the matrix covariance func-
' tion of the white noise process which drives the system model. This is 
referred to as the Q matrix and is a nonnegative definite matrix of the 
form: 
( e.:.4KT) L = 1 
J = (l e-(2K + b)T) 
y = ( 1 e-2bT) 
Q = q 
2 
KAL + ~IL J 2K + b 
A = b/(b - 2K) 
B = b/(2K - b) 
2 
KAL + _115_!L J 2K + b 
KL 
138 
APPENDIX E 
DESIGN STEPS FOR MULTI-PHASE PROCESSOR 
To evaluate the optimal design of the multi-phase array processor, 
certain data on each processor must be obtained. 
1. Cost of.each type of processor considered. 
2. Time necessary to complete one computation. 
3. Power (in watts) used to operate each type processor. 
4. Number of packages that compose each processor and number of 
pins used on each package. 
Once the hardware is acquired, the linear program equations are sub-
sequently created. Let Ci = cost of processor Pi, i = 1, 2, 3, ... 
c. > c.+1, i = 1, 2, 3, ... N. 1 -- l 
Let Tc equal total time allowed for matrix computations and Tpi 
equal the cycle time of each processor Pi. 
T1• = largest integer (T /T .), i = 1, 2, 3, .•. 
. c p1 
The time equation will be in the following form: 
+ TNPN ~ (number of terms in 
matrix). 
Let Pi equal the aggregate of processors of type P; necessary to compute 
the problem if only type Pi processors are employed. 
139 
140 
P' (number of elements in matrix)/T1. 
The resulting linear program is of the form: 
Minimize: 
< z 
Constraints: 
The solution to the linear program will exist in a region bounded 
above the time line and below the power and area lines. Prior to attempt-
ing to obtain the optimal solution, the solution region should be examined 
to determine if it exists in such a state that will allow the existence of 
a feasible solution. At this point a reduction or increase of the solu-
tion region is achieved by altering the values of WT and UT. This capa-
bility will facilitate the search for the integer linear program solution 
by effectively reducing the search domain. 
The solution to the integer linear program is generated by using 
available computer software and computer systems. The technique is to 
use a branch and bound algorithm based on the Land and Doig (32) method. 
Details of the algorithm are covered in Appendix A. The end result of 
the linear program will be a circuit of a practical nature in an optimal 
form to solve a vector, matrix product computation. Figure 30 illustrates 
the steps in the design sequence of the Multi-phased array processor. 
OBTAIN DATA: 
l. COST OF EACH TYPE OF PROCESSOR 
2. CYCLE TIME OF EACH PROCESSOR 
3. POWER(IN WATTS) USED BY EACH PROCESSOR 
4. NU~1BER OF PACKAGES THAT COMPOSE PROCESSORS 
t 
LET Tc = TOTAL TIME TO DATA OUTPUT 
t 
COMPUTE THE PARAMETERS OF CONSTRAINT EQUATIONS 
+ 
COMPUTE THE UPPER AND LOWER LIMITS OF THE 
POWER AND SIZE EQUATIONS TO BE UL TILIZED 
t 
PLOT LINEAR EQUATIONS IF POSSIBLE AND ATTEMPT 
TO REDUCE THE SOLUTION AREA IF POSSIBLE 
t 
OBTAIN THE OPTIMAL SOLUTION (APPENDIX B ) 
t 
DESIGN THE CIRCUITS USING THE RESULTS OF 
CHAPTER 5 AND THE OPTIMAL SOLUTION DATA 
Figure 30. Multi-phased Processor Design Flow 
Chart 
141 
VITA 2---
Larry Gene Stotts 
Candidate for the Degree of 
Doctor of Philosophy 
Thesis: OPTIMAL DISTRIBUTED MICROPROCESSOR ARCHITECTURE USING MULTI-
PHASE PROCESSING TO PERFORM A VECTOR, HATRIX MULTIPLICATION 
Major Field: Electrical Engineering 
Biographical: 
Personal Data: Born in Pawhuska, Oklahoma, September 7, 1949, the 
son of Mr. and Mrs. E. E. Stotts. 
Education: Graduated from Ponca City High School, Ponca City, Okla-
homa, in May, 1967; received the Bachelor of Science degree in 
Electrical Engineering From Oklahoma State University, Still-
water, Oklahoma, in May, 1972; received the Master of Science 
degree in Electrical Engineering from Oklahoma State University, 
Stillwater, Oklaho~a, in May, 1977; completed requirements for 
the Doctor of Philosophy degree at Oklahoma State University, 
Stillwater, Oklahoma, in July, 1979. 
Professional Experience: Communication officer, U.S. Army Signal 
Corps, May, 1972, to May, 1976; Instructor, Electrical Engin-
eering, Oklahoma State University, Stillwater, Oklahoma, 1978-
1979. 
Professional Organizations: Member of the Institute of Electrical 
and Electronic Engineers. 
'-. 
