Array processor architecture by Lundstrom, Stephen F. et al.
United States Patent [ I ~ I  
Barnes et al. 
11 4,4 12,303 
[45] Oct. 25, 1983 
1541 ARRAY PROCESSOR ARCHITECTURE 
[75] Inventors: 
I731 Assignee: 
[21] Appl. NO.: 
1221 Filed: 
George H. Barnes; Stephen F. 
Lundstrom, both of Wayne; Philip E. 
Shafer, Holmes, all of Pa. 
Burroughs Corporation, Detroit, 
Mich. 
97,191 
Nov. 26, 1979 
[ 5  I] Int. CI.3 ....................... G06F 15/16; G06F 15/20 
[52]  U.S. CI. .................................................... 344/900 
[58] Field of Search ... 364/200 MS File, 900 MS File 
1561 References Cited 
U.S. PATENT DOCUMENTS 
3,537,074 10/1970 Stokes et al. ........................ 364/200 
4,051,551 9/1977 Lawrie et al. ....................... 364/200 
4,101,960 7/1978 Stokes et al. ........................ 364/200 
OTHER PUBLICATIONS 
Concurrency: Single Processor Systems by Siewiork, 
Bell and Newell from Computer Structures 
(McGraw-Hill, Inc.), 1980. 
llliac IV System Proc. IEEE,  Apr. 1972, pp, 369-388. 
STARAN Proceedings Fall Joint Computer Confer- 
ence 1972, pp. 229-241. 
STARAN Proceedings National Computer Conference 
C.mmp/Hydra Project-Carnegie Mellon Univer- 
sity- 1971. 
1974, pp. 405-410. 
Primary Examiner-Gareth D. Shaw 
Assistant Examiner-John G. Mills 
Attorney, Agenf, or Firm-Leonard C. Brenner; M. L. 
Young; K. R.  Peterson 
ABSTRACT P I  
A high speed parallel array data processing architecture 
fashioned under a computational envelope approach 
includes a data base memory for secondary storage of 
programs and data, and a plurality of memory modules 
interconnected to a plurality of processing modules by a 
connection network of the Omega gender. Programs 
and data are fed from the data base memory to the 
plurality of memory modules and from hence the pro- 
grams are fed through the connection network to the 
array of processors (one copy of each program for each 
processor). Execution of the programs occur with the 
processors operating normally quite independently of 
each other in a multiprocessing fashion. For data depen- 
dent operations and other suitable operations, all pro- 
cessors are instructed to finish one given task or pro- 
gram branch before all are instructed to proceed in 
parallel processing fashion on the next instruction. Even 
when functioning in the parallel processing mode how- 
ever, the processors are not locked-step but execute 
their own copy of the program individually unless or 
until another overall processor array synchronization 
instruction is issued. 
8 Claims, 9 Drawing Figures 
SY CH RON1 2 AT ION 
GO 
33, 
d ARRAY I1 
I 
5 9 , ~  I GOT HERE ! 40 I 
PROC, 
I 
I 
I 67 OR 
36 I 
I 
“p I 
I 
I TO/FROM ‘ SUPPORT ‘ 
I 
49 
‘PROC 511 
I 
I /21 
PROCESSOR 
1 I 
?dl 
SYSTEM COMMUNI’CATION ’ REGISTER 
CN TO CN (ACESS TO EM) t 
BUFFER., 23 I 
9 1 DECODER I 
I 
https://ntrs.nasa.gov/search.jsp?R=20080005863 2019-08-30T03:06:20+00:00Z
U.S. Patent oct. 25, 1983 
EM0 EM1 EXTENDED EM520  
TOIFROM SUPPORT MEMORY - 
Sheet 1 of 6 4,4 12,303 
- 
23-- 
25 - 
27- 
2.8 X 10'' BITS/SEC. 
PROC.0 1 PROC. I 
7--- 
29 29 -7-- I I  - 
PROC. 5 I I 
1 
29 
A 
Y 
21 
U.S. Patent oct. 25, 1983 Sheet 2 of 6 4,4 12,303 
C 0 NN ECT IO N -. ! T°CN + . FROM CN MEMORY NETWORK 
,/ BUFFERKNB) 
Fig. 2 CONNECTION 
NE:$RK 
I------------ 1 
33, SYNCHRONIZATION/ , 
(EU) I 
I 
I I I  EXECUTION UNIT W'" 
I I 
I 9 
COORDINATOR ,21
Fig. 3 
PROCESSOR I 
,34 
- 
I 
TO/ FROM SUPPORT 
PROCESSOR SYSTEM 
ICR) 
I 
15 
S 0- 
a b 1  ' I G O T  /36  2 R HERE I ENABLE R I -  
i' 38 
0 I 
I 
I 
a 
e 
I 
I 
I 
- 
CONI ROLLER 
BASE 
MEMORY 
17 
COORDINATOR 
ICN) 
< 
U.S. Patent 
73’ 
Oct. 25, 1983 Sheet 3 of 6 4,412,303 
EM NO(WIRED INTO BACK PLANE)  MEMORY STORAGE UNIT 
(64 WORDS X 55 BITS) 
t 
I 
TOIFROM 
’ TOIFROM ! SUPPORT 49 
TOIFROM 
I CN 
PARALLEL 
13 - 
DBM ~ ONE- WORD - 
BUFFER TO BYTE- S E R I A L DATA 
79’, 
75 7 7  
ADDRESS ADDRESS 
MAR MAR 
FOR FOR FROM ADDRESS 
DBM DBM PROCICN 
CONTROLLER 
ADDRESS FROM CN 
U.S. Patent oct. 25, 1983 Sheet 4 of 6 4,4 12,303 
U.S. Patent oct. 25, 1983 Sheet 5 of 6 4,412,303 
PROCESSOR 
SIDE 
COMMAND 
TO/FROM 
FILE 
MEMORY 
Fiu. 7 
64K WD 
MEMORY 
DATA BUS 
(55 BITS) 
115’ 
105 
64K WD 
MEMORY 
- 
90 
/ 101 
CCD 
ARRAY 
DATA REGISTER 
(440 BITS) 
4 4 4 4 ,TO/FROMEXTENDED 
ME M ORY MOD U L E 
U.S. Patent oct .  25, 1983 Sheet 6 of 6 4,4 12,303 
I W 
M I 
u 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
L 
h 
0 
a- ;P 0 
M 
07 
U 
N 
0 
_ -  
I 
I 
I 
1 
I 
I 
I 
I 
I 
i 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
4,412,303 
1 2 
The manner of difficulty in programming the parallel 
array has been greatly eased by the incorporation and 
use of the computational envelope approach as dis- 
The invention described herein was made in the per- closed in U.S. Pat. No. 4,101,960, issued July 18, 1978, 
formance of work under NASA Contract No. NAS 5 in  the name of Stokes et al, and assigned to the assignee 
2-9897 and is subject to the provisions of Section 305 of of the present invention. Briefly in the computational 
the National Aeronautics and Space Act of  1958 (72 envelope approach a host or support processor of a 
Stat. 435, 42 U.S.C. 2457). general processing variety (such as a Burroughs B7800) 
functions as an 1/0 controller and user interface. Spe- 
10 cial purpose jobs are transferred in their entirety (pro- 
gram and data) to a large high speed secondary storage 
system Or data base memory and from hence to the 
array memory modulesand array Processor for Process- 
ing. During the special Purpose Processing Period the 
15 front end support or host processor is freed for other 
processing jobs. Once the complete special purpose job 
or task is executed by the array processors, the resul- 
tants therefrom are returned through the array memo- 
ARRAY PROCESSOR ARCHITECTURE 
RELATED u.s. PATENT APPLICAT1oNS 
US. patent applications directly or indirected related 
to the subject applications are as follows: Ser. No. 
097,419, tiled Nov. 26, 1979 by George H. Barnes et a1 
and titled Array Processor Architecture Connection 
Network. 
BACKGROUND A N D  OBJECTS OF THE 
INVENTION 
This invention relates generally to large scale data ries and the data base memory to the support processor 
processing systems and more particularly to systems 2o for Output to the user. 
employ,ng a plural,ty of processors operating 
execution time. 
I t  is another object of the present invention to pro- 
portant design goal has always been to maximize their 25 vide an array of processors which can function effici- 
lei processing. processed in a unit of time. It has become increasingly It is yet another object of the present invention to apparent in recent times that two important limiting provide a method and apparatus for quickly and effi- 
30 ciently synchronizing an array of independent proces- conditions exist within the present framework of com- puter design. These are the limits of component speed sors to begin a task in parallel, and of serial machine organization. To overstep these 
limitations two different types of parallel operating 
systems have been developed. 
or I t  is an object of the present invention to provide a 
less simultaneously in order to reduce overall program processing array functioning the computational 
lope mode of operation. 
In  the  development of digital computers a most im-  
operating speed, i.e., the amount of data that can be enty and effectively both in multiprocessing and para'- 
SUMMARY OF T H E  INVENTION 
In carrying out the above and other objects of this 
First, multiprocessing systems have been 35 invention, there is provided a support processor, a par- 
allel processing array, a coordinator for issuing special wherein a number Of quite independent processors have 
been linked together to operate in parallel on differing commands to the array and a large data base memory 
of that program or job. Frequently, the processors are and to the memory modules of the parallel processing linked together in a network loop or similar fashion, 40 array,  
thus greatly slowing the cooperation between proces- In operation, the support processor is programmed in 
When the Processors are linked together by a a high level language and transfers complete parallel 
parallel and much faster network such as a crossbar tasks to the data base memory whereupon the support 
network, the network control mechanism and the cost processor is freed to perform general purpose or other 
and reliability of the network quickly become un- 45 tasks. Upon parallel task completion. complete files are 
wieldly for a reasonable large number of processors. transferred back to the front end processor for the data 
Second, high speed parallel locked-step processing base memory, 
systems have been developed providing an array of The parallel processing array efficiently processes 
processing elements under the control of a single con- vector and other data elements in a parallel but not a 
trol unit. 50 locked-step fashion. When required, all processors may 
As speed requirements of computation have contin- be brought to a halt up011 completing one instruction 
ued to increase, systems employing greater numbers of before proceeding all in parallel on  the next instruction. 
parallel memory modules have been developed. One The parallel processors are connected to an array of 
such system has in the order of 64 parallel memories, see memory modules through an Omega type connection 
U.S. Pat. No. 3,537,074, issued Oct. 27, 1970, to R. A. 55 network. 
Stokes et al, and assigned to the assignee of the present Various other objects and advantages and features of 
invention. However, parallel processors have not been this invention will become more fully apparent in the 
without their own problems. following specification with its appended claims and 
Primarily, parallel processors are often so far re- accompanying drawings wherein: 
portions Of a program or job in order to speed execution having high speed I/o paths to the support processor 
moved from the conventional scaler processors that 60 
they are hard to program. Secondarily, parallel proces- 
sors are fashioned to operate efficiently with vectorized 
BRIEF DESCRIPTION O F  T H E  DRAWING 
FIG.  1 is a block diagram showing the environmental 
data but are quite inefficient operating upon scaler data. 
Finally, parallel processors, being found operating a FIG.  2 is a diagram of the major component parts of 
locked-step fashion in prior art force all processors in 65 a processor used in a processing array in the present 
the parallel array thereof to perform in synchronization 
whether or not such operation is needed in all proces- FIG.  3 depicts the arrangement of a coordinator i n  
son. 
architecture of the present invention; 
invention; 
the architecture of the present invention; 
4.41 2.303 
3 
FIG. 4 is a detailed diagram of the coordinator of 
FIG. 3; 
FIG. 5 is a diagram of the major component parts of 
a memory modules used in an extended memory module 
array in the present invention; 
FIG.  6 is a diagram of a partial portion of the Omega- 
type connection network used to interpose the proces- 
sors of FIG. 2 and the memory modules of FIG. 5: 
FIG. 7 is a logic diagram of a 2x2 crossbar switching 
element used in the Omega network of FIG. 6 
FIG. 8 is a circuit diagram of a control logic circuit 
used in the crossbar switching element of FIG. 7; and 
FIG.  9 is a logic diagram of a data base memory used 
to interface in a computational envelope architectural 
manner the parallel and multiprocessing array of the 
present invention and the support processor which pro- 
vides programs, data and I/O communication with the 
user. 
DESCRIPTION OF T H E  PREFERRED 
EMBODIMENT 
The present invention, see FIG. 1, comprises five 
major component elements; namely, the processor array 
11, the extended memory module array 13, the connec- 
tion network 15 interconnecting the processor array 11 
with the extended memory module array 13, the data 
base memory 17 with data base memory controller 19 
for staging jobs to be scheduled and for high-speed 
input/output buffering of jobs in execution, and the 
coordinator 21 used to synchronize the processor array 
11 and coordinate data flow through the connection 
network 15. 
In operation, all data and program for a run is first 
loaded into the data base memory 17 prior to the begin- 
ning of the run. This loading is initiated by a Support 
Processor System (not shown) which functions as a user 
and input/output interface to transfer under control of a 
user data and programs from a secondary storage file 
memory (not shown). The use of such a Support Pro- 
cessor System is detailed in the above-cited U.S. Pat. 
Nos. 3,537,074 and 4,101,960. 
As the run is initiated, the data base memory control- 
ler 19 transfers code tiles from the data base memory 17 
to the extended memory module array 13. Then the data 
base memory controller 19 transfers the processor array 
I1 code files thereto and the necessary job data to the 
extended memory module array 13. 
Once initated the present invention is capable of par- 
allel execution in a manner similar to the lock-step array 
machines disclosed in US. Pat. Nos. 3,537,074 and 
4,101,960. Simple programs (having a copy thereof 
resident in each processor of the processor array ll), 
with no data-dependent branching, can be so executed. 
However, the present invention is not limited to paralle 
mode operation since it  can also function in the manner 
of a conventional multiprocessor and, in fact, may be a 
commmerciall y available general purpose processor. 
Thus, as will be detailed hereinafter, the present inven- 
tion performs essentially just as efficiently whether the 
data is arranged in the form of vectors or not. The 
processor array 11 can function in the lock-step fashion 
for vector processing and in independent scalar fashion 
for multiprocessing operation. 
A simple but illustrative example of the operation of 
the present invention involves the following vector 
calculation: 
I .  A + B = C  
2. If C greater than 0 d o  subroutine W to calculate Z 
4 
3. If C equals 0 d o  subroutine X to calculate Z 
4. If C less than 0 d o  subroutine Y to calculate 2 
5 .  Calculate D=A divided by 2 
With vector notation it is appreciated that A repre- 
5 sents the elements a; from a/ to a,, wherein n equals the 
number of elements in the vector. The same relationship 
holds for vectors B, C, D and 2. Also it is appreciated 
that the elements of vectors are individually stored in 
memory modules of the extended memory module 
10 array 13 and that the elements are fetched therefrom 
and operated thereupon individually by the individual 
processors of the processor array 11. 
The elemets of the vectors are loaded into the mem- 
ory modules according to some particular mapping 
l 5  scheme. The simplest loading scheme would be to load 
the first vector element into the first memory module, 
the second vector element into the second memory 
module, etc. However, such a simple mapping does not 
lead to efficient parallel processing for many vector 
2o operations Hence, more complex mapping schemes 
have been developed such as disclosed in U.S. Pat. No. 
4,05 I S 5  1,  entitled “Multidimensional Parallel Access 
Computer Memory System”, issued Sept. 27, 1977 in 
the names of Lawrie et al, and assigned to the assignee 
25 of the present invention. The mapping scheme disclosed 
therein is incorporated in the “Scientific Processor” of 
U.S. Pat. No. 4,101,960, issued J u l y  18, 1978, in the 
name of Stokes et al and assigned to the assignee of the 
The actual mapping scheme selected is relatively 
unimportant to the basic operation of the present inven- 
tion. It is important however, that the rule of the map- 
ping scheme be stored in processor array 11 so that each 
35 processor therein can calculate by that rule where the 
vector element is stored that must be fetched. For exam- 
ple, if the first processor is always to fetch from the first 
memory module 13, and the second process from the 
second memory module 13, etc. the instruction stored in 
40 each processor would express “fetch element specified 
from memory i” where “i” would be the processor 29 
number. As wilt be detailed later, each processor 29 has 
wired in, preferably in binary format, its own processor 
number and each memory module likewise. It will also 
45 be appreciated that each stored element is identified as 
being stored in a particular memory module 13 at a 
particular storage location therein. The exact storage 
location is a direct function of the mapping used to store 
the element. This mapping is known and stored as a 
50 subroutine in each processor 11 to determine from 
whence it is to fetch its particular element. 
As will also be detailed later, the connection network 
15 can be set to “broadcast” to all processors 11 simulta- 
neously. Thus in one transferance it can store a copy of 
5 5  mapping subroutines and other program instructions in 
each processor 29. 
Following loading of the program instructions the 
execution of the problem or job to be solved begins. In 
the example, above, each processor 29 fetches its ele- 
60 ment ai of vector A and its element bj of vector B to 
calculate by addition its elements cj of vector C. Since 
the processors are all doing the same thing, the above 
fetching and calculating occurs nearly in parallel. 
However, each processor is now storing a value c, 
65 which is either greater than, equal to, or less than zero. 
It performs the appropriate subroutine to calculate its 
element zj of Z. Thus in sharp contrast to prior art 
locked-step processors, the processors 29 of the present 
3o present invention. 
494 
5 
invention are not all proceeding on the same branch 
instruction simultaneously. 
Assuming now that the entire vector Z must be deter- 
mined before the next step can be executed each proces- 
sor calculating a vector element z, issued a flag indicat- 
ing “I got here” or that it has successfully completed its 
last instruction, that of calculating zi. That processor 29 
2,303 
6 
the connection network buffer 23 through the Address 
Path 43, the store data path 45, and the fetch data path 
47. 
Associated with the execution unit 25 is an enable 
5 flip-flop 34 and a I-got-here flip-flop 36. A flag bit pro- 
grammed into the code executed by the execution unit 
25 indicates when the execution unit 25 is executing an 
is halted. The Coordinator 21 monitors all processors 29 instruction or task which must be concluded by itself 
that are calculating a vector element z;. When all such and before all other like execution units 25 working o n  
processors 29 issue their “I got here” flag, the Coordi- 10 the same data proceed with further instructions. This 
nator 21 issues a “GO” signal and all processors 29 then flag bit sets enable flip-flop 34 thereby raising the enable 
begin the next instruction (in the present example, that output line 44 therefore and deactivating the non-enable 
of calculating vector element diof D). Thus, the proces- line 38 thereof which is fed to the coordinator 21 to 
sors 29 can function independently and effectively but indicate which processors 29 are not enabled. When the 
still be locked in parallel by a single instruction when 15 instruction or task is completed by the execution unit 25 
such parallel operation is called for. (Le., the calculation by “Z” in the example above), a bit 
Following calculation of the elements di of the vector is sent out to set the “I-got-here” flip-flop 36 which 
D, the vector D is fed back through the connection raises its I-got-here output line 40 to the coordinator 21. 
network 15 to the extended memory module 13 and The coordinator 21 thereup issues a command via its 
from hence through the data base memory 17 to the 20 command and synchronization lines 31 and 33 to halt 
user. the processor 29 until the coordinator 21 receives an 
The above simplified illustrative example of the oper- I-got-here signal from all enabled processors 29. Then a 
ation of the present invention presented on overview of “GO” signal is used from the coordinator 21 through 
the structure and function of the present invention. A line 42 and A N D  gate 44 to reset the I-got-here flip-flop 
fuller understanding may be derived from a closer ex- 25 36 and the enable flip-flop 34. The Coordinator 21 also 
amination of its component parts. releases its halt commands through lines 31 and 33 and 
The processor array 11 consists of 512 independent all processors 29 begin in parallel to execution of the 
like processors 29 noted as Processor 0 through Proces- next task or instruction which usually involves the 
sor 511 in FIG. 1. Each processor 29 includes a connec- fetching of freshly created or modified data from the 
tion network buffer 23, an execution unit 25 and a local 30 extended memory array 13 through the connection 
processor memory 27. The number of processors 29 in network 15. 
the processor array 11 is selected in view of the power The Processor Memory 27 holds instructions for 
of each processor 29 and the complexity of the overall execution of the Execution Unit 25 and the data to be 
tasks being executed by the processor array 11. Alter- fetched in response to the instructions held. Preferably, 
nate embodiments may be fabricated employing more 35 the Processor Memory 27 is sufficient for storage of 
or less than 512 processors 29 having less or more pro- over 32,000 words of 48 bits plus 7 bits of error code 
cessing capability than detailed below. each. Data, address and control communications is 
solely with the Execution Unit 25. 
The Connection Network Buffer 23 functions as a Processor 
In the array of 512 (516 counting spares) processors 40 asynchronous interface with the Connection Network 
29, each processor is identical to all others except for an 15 to decouple the Processor 29 from the access delays 
internal identifying number which is hardwired into of the Connection Network 15 and the extended Mem- 
each processor (by 10 lines representing in binary code ory Module Array 13 (FIG. 1). The Connection Net- 
the number of the processor) in the preferred embodi- work Buffers 23 communicates basically three items; 
rnent but may be entered by firmware or software in 45 the number of a particular Memory Module in the Ex- 
alternate embodiments. It is of importance only that tended Memory Module Array 13 with which commu- 
when data is to flow to or from processor number 014 nication is desired plus the operation code (i.e., fetch, 
(for example) that a single processor 29 is identified as store, etc.), the address within the particular Memory 
being number 014. Module, and one word of data. The presence of an 
Each processor 29 is in itself a conventional FOR- 50 Extended Memory Module number in the Connection 
TRAN processor functioning to execute integer and Network Buffer 23 functions as a request for that Ex- 
floating-point operations on data variables. Each pro- tended Memory Module. An “acknowledge” signal is 
cessor 29, see FIG.  2, is partitioned into three parts; returned via the Connection Network 15 from the Ex- 
namely, the Execution Unit 25, the Processor Memory tended Memory Module selected to indicate the success 
27, and the Connection Network Buffer 23. 
The Execution Unit 25 is the logic part of the Proces- Each and every Connection Network Buffer 23 is 
sor 29. The Execution Unit 25 executes code contained clock-synchronized with the Connection Network 15 to 
in its Processor Memory 27, fetches and stores data eliminate time races therethrough. 
through the Connection Network 15 via its Connection Coordinator Network Buffer 23. As will be detailed hereinafter, the 60 
Execution Unit 25 also accepts commands from the The Coordinator 21, as seen in FIG. 3, in essence 
Coordinator 21 via the Command Line 31 and synchro- performs two major functions: first, it communicates 
nization via the Synchronization Line 33. The Execu- with the Support Processor primarily to load jobs into 
tion Unit 25 executes instructions stored in its Processor and results out of the data base memory 19; second, it 
Memory 27 by addressing same through Address Path 65 communicates with the processors 29 in order to pro- 
37 to fetch instruction code on Fetch Line Path 39 or to vide synchronization and control when all processors 
store instruction code on Store Path 41. Likewise, data 29 are to operate in parallel on a particular piece of data 
is transferred through the connection network 15 via or at a particular step in a program. 
5 5  of the request. 
7 
4,412,303 
The coordinator’s function in communicating with 
the Support Processor is relatively simple under the 
computational envelope approach such as detailed, for 
example, in U.S. Pat. No. 4,101,960. Under the compu- 
tational envelope approach all data and programs for at 
least one job is transferred from the Support Processor 
to the Data Base Memory 17. Thereafter, the Support 
Processor is freed. Program and data are transferred to 
the Extended Memory Module Array 13 and eventually 
to the Processor Array 11 for processing. Upon comple- 
tion of the processing, results are communicated back to 
the Data Base Memory 17 and finally back to the Sup- 
port Processor. 
Communication is maintained between the Coordina- 
tor 21 and the Support Processor via a Communication 
Register 49 and with the Data Base Memory Controller 
19 through an I/O Buffer and Decoder 51, see FIG. 4. 
A Connection Network Buffer 23 identical in struc- 
ture and function to the Connection Network Buffer 23 
of each Processor 29 (see FIG. 2) permits the Coordina- 
tor 21 to communicate with the Extended Memory 
Module Array 13 through the Connection Network 15 
as easily as if it were a Processor 29. Likewise, a Con- 
nection Network Port 53 permits the Coordinator 23 to 
Communicate with the Processors 29 through the Con- 
nection Network 15 as easily as if it were a memory 
module in the Extended Memory Module Array 13. 
Other important links to the Coordinator 23 from the 
Processors 29 are via the “I got here” lines 40 and the 
NOT Enabled lines 38 from each Processor 29. By 
ORing the lines 38 and 40 individually for each Proces- 
sor 29 in an O R  circuit 59 and summing all OR circuit 
59 output lines 61 through an A N D  circuit 63 an output 
65 is obtained which signifies that every enabled Pro- 
cessor 29 has finished its current processing task and 
raised its “I got here” line 40. The output line 65 is fed 
through Control Logic 67 to issue a “GO” signal on G O  
line 42 to release all enabled Processors 29 and allow 
them to continue processing in parallel. The control 
logic 67 also provides on line 33 the synchronization for 
the Processors 29 to provide proper timing between the 
Processors 29 and the Connection Network 15. Further, 
the Control Logic 67 provides standard commuication 
control on a Communication Bus 73 between the Com- 
munication Register 49, I/O Buffer and Decoder 51, 
Connection Network Port 53, and the Communication 
Network Buffer 23. 
Extend Memory Module 
The Extended Memory Module 13 is the “main” 
memory of the present invention in that it holds the data 
base for the program during program execution. Tem- 
porary variables, or work space, can be held in either 
the Extended Memory Module 13 or the Processor 
Memory 27 (see FIG. l), as appropriate to the problem. 
All I/O to and from the present invention is to and from 
the Extended Memory Module 13 via the Data Base 
Memory 19 (see FIG. 1). Control of the Extended 
Memory Module 13 is from two sources; the first being 
instructions transmitted over the Connection Network 
15 and the second being from the Data Base Memory 
Controller 19 (see FIG. 1) which handles the transfers 
between the Data Base Memory 19 and the Extended 
Memory Module 13. 
In the preferred embodiment of the present invention 
there are 521 individual memory modules in the Ex- 
tended Memorv Module 13. The number 521 is chosen 
8 
(512) of Processors 29. The combination of 521 Memory 
Modules 13 with 512 Processors 29 facilitates vector 
processing as detailed in U.S. Pat. Nos. 4,051,551 and 
4,101,960. 
Each memory module 13 is identical to all others 
except that it has its own module number (i.e., 0-520) 
associated with it, preferable in a hardwired binary 
coded form. The purpose of having a memory module 
13 numbered is to provide identification for addressing. 
10 Storage locations within each memory module 13 are 
accessed by the Memory Module number and storate 
location within that memory module comprising in 
essence the total address. 
Each memory in the Extended Memory Module 13 is 
I5 conventional in that i t  includes basic storage and buffer- 
ing for addressing and data, see FIG.  5. Basic storage is 
provided in the preferred embodiment by a Memory 
Storage Unit 73 sufficient to store 64,000 words each 
having 55 bits (48 data bits and 7 checking bits). High 
20 speed solid state storage is preferred and may be imple- 
mented by paralleling four 16K RAM memory chips. 
Standard address registers are also provided; a First 
Memory Address Register 75 for addressing from the 
Data Base Memory 17 and a second Memory Address 
25 Register 77 for addressing from the Processors 29 via 
the Connection Network 15. Data Buffering is provided 
by a one-word buffer 79 for data communication with 
the Data Base Memory 17 and a parallel-to-byte-serial 
buffer 81 for communication through the Connection 
30 Network 15. Byte communication rather than word 
communication is handled through the Connection Net- 
work 15 to minimize the number of data paths and 
switching paths required therethrough. Alternate em- 
bodiments may extend, for example, to bit communica- 
35 tion which is simple but slow to word communication 
which is relatively faster but likewise quite expensive 
and massive in hardware implementation. 
Communication through the connection network 15 
and the Extended Memory Module 13 is straightfor- 
40 ward. A strobe signal and accompanying address field 
indicates the arrival of a request for a particular Ex- 
tended Memory Module by number. The requested 
number is compared to the actual memory module num- 
ber (preferably hardwired in binary coded format) and 
45 a true comparison initiates an “acknowledge” bit to be 
sent back to the requesting Processor 29 and to lock up 
the connection Network 15 path therebetween. 
As will be detailed hereinafter, following the strobe, 
and accompanying the address field, will be any one of 
(1) STOREM. Data will follow the address; keep up 
the acknowledge until the last character of data has 
arrived. The timing is fixed; the data item will be 
just one word long. 
(2) LOADEM. Access memory at the address given, 
sending the data back through the Connection Network 
15, meanwhile keeping the “acknowledge” bit up until 
the last 11 bit frame has been sent. 
(3) LOCKEM. Same as LOADEM except that fol- 
60 lowing the access of data, a O N E  will be written into 
the least significant bit of the word. If bit was ZERO, 
the pertinent check bits must also be complemented to 
keep the checking code correct. The old copy is sent 
back over the Connection Network 15. 
(4) FETCHEM. Same as LOADEM except that the 
“acknowledge” is dropped as soon as possible. The 
Coordinator 21 has sent to this code to imply that it will 
5 
50 four different commands, namely: 
5 5  
65 
because it  is a-prime number larger than the number switch the Connection Network 15 to broadcast mode 
9 
4,412,303 
10 
for the accessed data. The data is then sent into the Although only 16 processor ports (PPO-PPl5) and 16 
Connection Network 15 which has been set to broad- Extended Memory Ports (EMPO-EMP15) are shown in 
cast mode by the Coordinator 21 and will go to all FIG. 5, the Omega type connection network 15 is 
processors 29. readily expandable to handle any number of processor 
Each switch element 83 has an upper or first proces- Connection Network 
The Connection Network 15 has two modes of opera- sor side port 84, a lower or second processor side port 
tions. In a special purpose mode detailed hereinafter the 86, an upper or first extended memory side port 88 and 
Coordinator 21 may use the Connection Network 15 to a lower or second extended memory side port 90. Fur-  
perform special tasks. In the special mode, a typical 10 ther, each switch element 83 includes a plurality of 
operation for the Connection Network 15 is the “Broad- A N D  logic gates 89, a plurality of OR logical gates 91, 
cast” operation wherein under command from the Co- and a control logic circuit 93, see FIG. 7. The control 
ordinator 21 a word of data is “broadcast” to all proces- logic examines one bit of the data flowing to the switch 
sors 29 from either the Coordinator 21 or a selected element 83 to control the passage of data therethrough 
particular Extended Memory Module 13. 
In the normal mode of operation a “request strobe” The control logic circuit 93 generates control signals 
establishes a two-way connection between the request- E l ,  E2, E3 and E4 to control the flow of data through 
ing processor 29 and the requested Extended Memory the switch element 83. The control logic circuit is fed 
Module 13. The establishment of the connection is ac- by two bits each from the upper and lower processor 
knowledged by the requested Extended Memory Mod- 20 ports 84 and 86. The two bits from the upper processor 
ule 13. The “acknowledge” is transmitted to the re- port 84 are inputted on line 92. One of the bits is a strobe 
quester. The release of the connection is initiated by the signal indicating that an addressing request is passing 
Extended Memory Module 13. Only one request arrives through the switch element 83 and the other bit indi- 
at a time to a given Extended Memory Module 13. The cates whether the request is to exit through upper port 
connection Network 15, not the Extended Memory 25 88 or lower port 90. As will be detailed, the control 
Module 13 resolves conflicting responses. logic circuit 93 recognizes the strobe bit and honors the 
With reference to FIGS. 1 and 6, the Connection exit request if the requested exit port is free. The control 
Network 15 appears to be a dial-up network with up to logic circuit 93 will also keep the requested path 
512 callers, the processors 29, possibly dialing at once. through the control logic circuit 83 open or locked long 
There are 512 processor ports (only 16 shown, 30 enough for an “acknowledge” signal to return from the 
PPO-PPIs), 521 Extended Memory Ports (only 16 Extended Memory Module 113 indicating a successful 
shown, EMPO-EMPIS), and two Coordinator ports path connection through the entire connection network 
(see FIG. 4), one the Connection Network Port 53 15. The “acknowledge” signal will keep the path con- 
functioning as an Extended Memory Port and the other, nection locked for a time sufficient to pass the desired 
the Connection Network Buffer 23 functioning as a 35 data therethrough. If no “acknowledge” signal is re- 
processor port. turned within a time sufficient for the request to travel 
With reference now to FIG. 6, it can be seen that the through the connection network 15 and the “acknowl- 
Connection Network 15 is a standard Omega Network edge” signal echoed back, then the control logic circuit 
comprised of a plurality of switching elements 83 93 will release the requested path through the switch 
wherein each switching element 83 is in essence a 40 element 83. 
t wo-b y- t wo crossbar network. The control logic circuit 93 receives the two bits 
Addressing is provided by the requester and is de- above-described from input line 92 and a similar two 
coded one bit at a time on the fly by the Connection bits from the lower processor port are inputted on line 
Network 15. Consider for example, that processor port 94. The “acknowledge” bit arriving through the upper 
PPlO desires communication with Extended Memory 45 Extended Memory Port 88 is inputted on line 96 while 
Port EMP11. The processor port PPlO transmits the the “acknowledge” bit arriving through the lower Ex- 
Extended Memory Port E M P l l  number in binary form tended Memory Port 90 is inputted through line 98. 
(101 1). Each switching element 83 encountered exam- T w o  commands from the Coordinator 21 are received 
ines one bit in order from the most significant bit to the on line 100. Although the control logic circuit 93 is 
least. Thus switch elements 830 examines a binary one 50 shown in more detail in FIG. 8, it is appreciated that 
and therefore outputs on its lower (reference FIG. 6) many alternative embodiments could be fashioned to 
line 85. Switch element 836 examines a binary zero and fulfil the function of the control logic circuit 93 as 
therefore outputs on its top line 87. Switch elements 83c above-described. The control logic circuit 93 includes 
and 83d both examine binary ones leading to a final in the FIG. 8 embodiment thereof four  identical input 
output to EMP11. For E M P l l  to communicate back to 55 A N D  gates 1020 through 102d for summing the above- 
PPlO a binary representation of ten (1010) is transmitted described strobe and exit port request bits. T w o  inverter 
thereby causing in the above described manner commu- gates 104a and 1046 are provided to complement the 
nications to be established through switch elements 83d. exit port request bits. Four strobe circuits 106a through 
c, b, and a to PPlO. 106d are provided. Each strobe circuit 106 when trig- 
For a special or “broadcast” mode of operation it  can 60 gered remains “ON” for a period of time sufficient for 
be seen from FIG. 6 that if all switch elements 83 were an “acknowledge” signal to arrive back if a successful 
to establish dual communication paths therethrough path is completed through the entire connection net- 
communication could be established between any one work 15. Each strobe circuit 106 feeds through an OR 
Extended Memory Port EMPO through EMPl5 to all gate 108 shown individually as OR gates 1080 through 
processor ports PPO through PP15. Likewise communi- 65 10W to produce an energizing signal E identified indi- 
cation can be broadcast from any one of the processor vidually as E l  through E4 to open and hold open the 
ports PPO through PP15 to all of the Extended Memory request path through the switching element 83 (see 
Ports EMPO through EMPIS. FIG. 7). 
5 and memory ports. 
15 in accord with the above-described operation. 
4,412,303 
11 
Single “acknowledge” bits are sent back on lines 96 
and 98 and are combined through A N D  gates 1100 
through llOd with the outputs of the strobe circuits 
1060 through 106d as shown in FIG. 8 to initiate a latch 
circuit 112 shown individually as latch circuits 1120 
through 112d to “latch” or keep locked the requested 
path through the switching element 83 for a period 
sufficient to pass at least an entire data word of 55 bits 
therethrough, 11 bits at a time. It is realized that the 
strobe 106 and latch 112 circuits may be fashioned as 
monostable multivibrators or other delay or timing 
devices depending on the length of time (i.e., how many 
nanoseconds) the strobe latch is required to be “ON”. 
The length of time required is dependent upon the type 
of circuit elements chosen, the size and the clocking 
speed of the connection network 15. 
In an alternate or special mode of operation a two bit 
signal is received from the Coordinator 21 on line 100 
and processed through exclusive OR circuit 114 and 
OR gates 1080 through lO8d to open all paths through 
the switching element 83 to provide for a “broadcast” 
mode wherein an one particular processor 29 may “talk 
to” all of the memory modules in the Extended memory 
Module Array 13 and wherein any one memory module 
in the Extended Memory Module Array 13 may load 
each processor 29 in the Processor Array 11. 
Data Base Memory 
Referring again to FIG. 1, the Data Base Memory 17 
is the window in the computational envelope approach 
of the present invention. All jobs to be run on the pres- 
ent invention are staged into the Data Base Memory 17. 
All output from the present invention is staged back 
through the Data Base Memory 17. Additionally, the 
Data Base Memory 17 is used as back-up storage for the 
Extended Memory module 13 for those problems 
whose data base is larger than the storage capacity of 
the Extended Memory Module 13. Control of the Data 
Base Memory 17 is from the Data Base Memory Con- 
troller 19 which accepts commands both from the Co- 
ordinator 21 for transfers between the Data Base Mem- 
ory 17 and the Extended Memory 13, and from the 
Support Processor System (not shown) for transfers 
between the Data Base Memory 17 and the File Mem- 
ory (not shown). 
In the preferred embodiment of the invention, see 
FIG. 9, a general C C D  (charged-coupled device) array 
101 is used as the primary storage area, and two data 
block size buffers memories 103 and 105 of 64K word 
capacity each are used for interfacing to the secondary 
storage file memory. Experience in large data array 
systems and scientific array processors indicate that 
about 99% of the traffic between the data base memory 
17 and the tile memory is generally simple large data 
block transfers of program and data. To provide for 
high volume-high speed transfers, four data channels 
107, 109, 111 and 113 are provided. 
The buffer memories 103 and 105 are connected to 
the C C D  array 101 through a data bus 115, preferably 
55 bits wide, and a data register 117 of 440 bits width. 
The data bus 115 feeds directly to  the Extended Mem- 
ory Modules 21 with no additional buffering required 
except for the one-word ( 5 5  bit) 1/0 Buffer 51 (see FIG. 
4) provided with each Extended Memory Module 21. 
Data Base Memory Controller 
The data base memory controller 19 interfaces two 
environments: the present invention internal environ- 
ment and the file memory environment, since the Data 
Base Memory 17 is the window in the computational 
envelope. The Data Base Memory 17 allocation is under 
the control of the file memory function of the Support 
5 Processor. The Data Base Memory Controller 19 has a 
table of that allocation, which allows the Data Base 
Memory Controller 19 to convert names of files into 
Data Base Memory 17 addresses. When the file has been 
opened by a present invention program it  is pro- 
10 grammed as far as allocation is concerned, and remains 
resident in Data Base Memory 17 until either is closed 
or abandoned. For open files, the Data Base Memory 
Controller 19 accepts descriptors from the coordinator 
21 which call for transfers between Data Base Memory 
15 17 and Extended Memory Modules 13. These descrip- 
tors contain absolute Extended Memory Module 13 
addresses but actual file names and record numbers for 
the Data Memory Base 17 contents. 
Operation is as follows. When a task for the present 
20 invention has been requested, the Support Processor 
passes the names of the files needed to start that task. In 
some cases existing files are copied into newly named 
files for the task. When all files have been moved into 
the Data Base Memory 17, the task starts in the present 
25 invention. When the task in the present invention opens 
any of these files, the allocation will be frozen within 
the Data Base Memory 17. It is expected that “typical” 
task execution will start by opening all necessary files. 
During the running of a present invention task, other 
30 file operations may be requested by the user program on 
the present invention, such as creating new files and 
closing files. 
Extended Memory Module 13 space is allocated ei- 
ther at compile time or dynamically during the run. In 
35 either case, Extended Memory Module 13 addresses are 
known to the user program. Data Base Memory 17 
space, on the other hand, is allocated by a file manager, 
which gives a map of Data Base Memory 17 space to 
the Data Base Memory Controller 19. In asking the 
40 Data Base Memory Controller 19 to pass a certain 
amount of data from Data Base Memory 17 to Extended 
Memory Module 13, the Coordinator 21, as part of the 
user program, issues a descriptor to the Data Base 
Memory Controller 19 which contains the name of the 
45 Data Base Memory 17 area, the absolute address of the 
Extended Memory Module 13 area, and the size. The 
Data Base Memory Controller 19 changes the name to 
an address in Data Base Memory 17. If that name does 
not correspond to an address in Data Base Memory 17, 
50 an interrupt goes back to the Coordinator 21, together 
with a result descriptor describing the status of the 
failed attempt. 
Not all files will wait to the end of a present invention 
turn to be unloaded. For example, the number of snap- 
5 5  shot dumps required may be data dependent, so it  may 
be preferable to create a new file for each one and to 
close the file containing a snapshot dump so that the 
File Manager can unload it from Data Base Memory 17. 
When the present invention task terminates normally, 
60 all tiles that should be saved are closed. 
Although the present invention has been described 
with reference to its preferred embodiment it is under- 
stood by those skilled in the art that many modifica- 
tions, variations, and additions may be made to the 
65 description thereof. For  example, the number of proces- 
sors or memory modules in the arrays thereof may be 
increased as specific processing and storage require- 
ments may dictate. Also, although the Connection Net- 
13 
4,412,303 
work is described as an Omega network it is clear that 
any  network having local mode control and the ability 
to decode path direction bits or flags on the fly may be 
used. Further, the Omega network may be doubled in 
size so as to minimize the effect of a single blocked path. 
Routine mapping algorithms may interpose actual mem- 
ory module destinations and memory module port des- 
ignations if desired. Additional gating may be provided 
in each switching element of the Connection Network 
to allow for a “wrap-around’’ path whereby processors 
may communicate with each other as well as with mem- 
ory modules. The control from the Coordinator may be 
expanded so that there can be two separate broadcast 
modes; one to the processors, and the reverse to the 
memory modules. 
Further, although the present invention has been 
described with a crossbar network having each switch- 
ing element fashioned to examine all incoming data for 
a “strobe” or addressing bit, it is appreciated that once 
a “strobe” bit is detected and an “acknowledge” bit 
returned that logic could be provided to free up the bit 
position of the “strobe” bit for other purposes during 
the period when an acknowledged latch was present 
and data was being transferred through the switching 
element from a processor to a memory module. For 
example, the freed-up strobe bit position could be used 
for a parity bit for the data being transferred. Other like 
changes and modifications can also be envisioned by 
those skilled in the art without departing from the sense 
and scope of the present invention. 
Thus, while the present invention has been described 
with a certain degree of particularity, it should be un- 
derstood that the present disclosure has been made by 
way of example and that changes in the combination 
and arrangement of parts obvious to one skilled in the 
art, may be resorted to without departing from the 
scope and spirit of the invention. 
What is claimed is: 
1. An array processing system interfacing with a host 
processor for performing I/O functions and a file mem- 
ory, said array processing system comprising: 
data base buffer memory means having at least one 
data transfer channel connected to the file memory 
under control of the support processor for rapid 
transfer of information therethrough, said data base 
memory means receiving parallel processing jobs 
from the file memory and returning to the file 
memory the processed resultants of the jobs re- 
ceived, said parallel processing jobs including both 
programs for vector and scalar processing and 
related data; 
an array of parallel memory modules connected via a 
second data transfer channel to said data base mem- 
ory means, each memory module in said array 
storing data from said data base memory means 
during processing by said array processing system; 
an array of parallel processor means, each processor 
means being programmable for independently pro- 
cessing data fetched approximately in parallel from 
individual memory modules, each processor means 
having indicator means for indicating when a fu- 
ture operation must be initiated in parallel with at 
least one other processor means for indicating 
when it is ready to begin said future operation; 
connection network means interposing said array of 
parallel memory modules and said array of parallel 
14 
processor means, said connection network means 
providing a plurality of data communication paths 
between said array of parallel memory modules 
and said array of parallel processor means wherein 
each provided data communications path in said 
plurality thereof is provided to a requested individ- 
ual memory module from a requesting individual 
processor means: and 
coordinator means connected to each processor 
means for monitoring each said indicator means for 
indicating when a future operation must be initi- 
ated in parallel with at least one other processor 
means and for inhibiting each processor means so 
indicating from initiating said future operation until 
each processor means so indicating is also indicat- 
ing that it  is ready to begin it’s future operation. 
2. The array processing system according to claim 1 
said array of parallel memory modules comprises an 
array of a prime number of identical memory mod- 
ules; and 
said array of parallel processor means comprises an 
array of a power-of-two number of identical pro- 
cessors, said power-of-two number being the great- 
est power-of-two number less in value than said 
prime number. 
3. The array processing system according to claim 2 
wherein said prime number is 521 and said power-of- 
4. The array processing system according to claim 1 
a solid-state information storage unit; and 
a data base memory controller means for providing 
buffering and interfacing between said solid-state 
information storage unit and the file memory and 
between said solid-state information storage unit 
and said array of parallel memory modules. 
5. The array processing system according to claim 1 
40 wherein each processor in said array of parallel proces- 
5 
10 
wherein: 
2o 
25 
3o two number is 512. 
wherein said data base memory means includes: 
35 
sor means further includes: 
an execution unit; 
a processor memory for storing instructions for exe- 
cution by said execution unit; and 
a connection network buffer for interfacing between 
said connection network means and said execution 
unit. 
6. The array processing system according to claim 1 
wherein each memory module in said array thereof 
50 includes a solid-state random access memory unit suffi- 
cient in size to store 64,000 words of 5 5  bits each. 
7. The array processing system according to claim 1 
wherein: 
said connection network further includes means for 
55 providing simultaneously data communication 
paths to all processor means in said array thereof 
for broadcasting thereto from any memory module 
in said array thereof. 
8. The array processing system according to claim 1 
said connection network further includes means for 
providing simultaneously data communication 
paths to all memory modules in said array thereof 
for broadcasting thereto from any processor means 
45 
60 or claim 7 wherein: 
65 in  said array thereof. * * * * *  
