Massively parallel processor computer by Fung, L. W.
United States Patent 
ARRAY 
CONTROL 
Y Z J )  
4,380,046 
ARRAY ‘ --22 N-BIT 
3 & F J l ~  UNIT I -44 ,flpEE& -b, 
1451 Apr. 12, 1983 
_. .____. Fung 
[54] MASSIVELY PARALLEL PROCESSOR 
[76] Inventor: Robert A. Frosch, Administrator of 
the National Aeronautics and Space 
Administration, with respect to an 
invention of Lai-Wo Fung, 
Morristown, N.J. 
COMPUTER 
[21] Appl. No.: 41,143 
[22] Filed: May 21, 1979 
[Sl] Int. ( 3 . 3  ..................... G06F 15/16; G06F 15/347 
r52i U.S. a. .................................................... 3 ~ 2 0 0  
i58j Field of Search ... 364/200 MS File, 900 MS File; 
235/92 SH 
[561 References Cited 
U.S. PATENT DOCUMENTS 
3,287,702 I1/1966 
3,287,703 11/1966 
3,372,382 3/1968 
3,473,160 10/1969 
3,544,973 12/1970 
3,815,095 6/1974 
3,936,806 2/1976 
4.03 1 3  12 6/1977 
4,047,201 9/1977 
4,065,808 12/1977 
4,092,522 5/1978 
4,162,532 7/1979 
........... 364/200 
Faber .................................. 364/200 
Kerllenevich ...................... 364/200 
364/200 
5/92 SH 
364/200 
OTHER PUBLICATIONS 
“The Solomon Computer” Slotnick et al., Proceedings 
9’ 
of the 1962 Fall Joint Computer Conference, pp. 
H. Gschwind, Design of Digital Computers, 1975, pp 
Primary Examiner-James D. Thomas 
Assistant Examiner-David Y. Eng 
Attorney, Agent, or Firm-Ronald F. Sandler; John R. 
Manning; John 0. Tresansky 
t571 ABSTRACT 
An apparatus for processing multidimensional data with 
strong spatial characteristics, such as raw image data, 
characterized by a large number of parallel data streams 
in an ordered array, comprises a large number (e.g. 
16,384 in a l28X 128 array) of parallel processing ele- 
ments operating simultaneously and independently on  
single bit slices of a corresponding array of incoming 
data streams under control of a single set of instructions. 
Each of the processing elements comprises a bidirec- 
tional data bus in communication with a register for 
storing single bit slices together with a random access 
memory unit and associated circuitry, including a bi- 
nary countedshift register device, for performing logi- 
cal and arithmetical computations on the bit slices, and 
an I/O unit for interfacing the bidirectional data bus 
with the data stream source. The massively parallel 
processor architecture enables very high speed process- 
ing of large amounts of ordered, parallel data, including 
spatial translation by shifting or “sliding” of bits verti- 
cally or horizontally to neighboring processing ele- 
ments. 
97-107. 
179- 19 1. 
14 Claims, 15 Drawing Figures 
-42 
https://ntrs.nasa.gov/search.jsp?R=19830017107 2020-03-22T01:01:40+00:00Z
U.S. Patent Apr. 12, 1983 
P 
Y ARRAY I 0-22 N-BIT 
ARRAY 
CONTROL W) -:a > W U !  UNIT I -44 # ? E & ? ?  
Sheet 1 of 6 4,380,846 
+ 
38 INTERfACE MODULE 
40 HOST 
-42 
FIG. I 
?-----------I / 44 
PATHS 
N-611 
OUTPUT 
PORT 
S- REGISTERS 
U.S. Patent Apr. 12, 1983 Sheet 2 of6 4,380,046 
TO FOUR 
NEIGHBORING 
PE'S 
FIG.5 
1 
100 
FROM FOUR 
NEIGHBORING 
U.S. Patent Apr. 12, 1983 Sheet 3 of 6 4,380,046 
U.S. Patent Apr. 12, 1983 Sheet 4 of 6 4,380,046 
9 
DATA- e 
--- DATA-IN 
CLOCK SIGNAL OQWNSHIFT MASTER INCREMENT 
FROM MASK COMMAND CLOCK CDMMAND 
SUBUNIT 58 
98 - - ,-?, -- - 
TO BCfSR 
SUBUNIT 
54 
I 
RILNooM 
ACCESS 
MEMORY 
ADDRESS 
$2 
G -REGISTER  
D- 
-202 
246' 
-212 
.-a2 *WRITE ENABLE 
- 
VLLaTLV DELAYED ), DELAYED dwu MODE A WRITE-IN F I G. 7co""*"D &V,NO P-R ISTER COMMAND W R I R - I N  
U.S. Patent Apr. 12, 1983 Sheet 5 of 6 4,380,046 
I N S ~ C T I O N  INS~RUCT~ON 
AVAILABLE AvAi 
FOR CYCLE M FOR CYCLE M+ I) 
I READ &T LOWESTSTAG€ 
OFBC/SR5oNJD I 
BUFFER REGISTER 160, 
I 
2. READ OUT DATA 
FROM LMU50 
3. READ OUT RATA 
4. WRITE -gNABLE 
~~~F~ 160 TO DATA BUS 52 
ACTIVE OR LMU 50 
k-4 
BCISR 54 IS 
INCREMENTED AT 
LOWEST STAGE 
t 
WRITE INTO 
R-5-AND 
G-REGISTERS 
TO MASK SUBUNIT 58 (FIC.7) TO LMU Sb(FIG.fi~ 
TO I/O 48 (FlO.6J LOGJC/SLIMR TO t V
.- A 
PE COMMANDS FROM 
ARRAY INSTRUCTION REGISTER IN ACU 24 I I (ALLAVAILABLE AT RIUNC fDGES OF MASTER CLOCK) 
U.S. Patent Apr. 12, 1983 Sheet 6 of 6 4,380,046 
MIAYED 
COMMAND 
To202 
S-REGIST R 
SLIDE 08 
WRITE-IN J-1 
COMMAND 
FROM ACU U 
CLOCK Jq+JlJ 
DELAY ED 
COMMAND 
TO 214 
} IEh 
}12f 
FIG.12 
4,380,046 
1 
time must be devoted, however, to data partitioning, 
and routing, and is, therefore, impractical. 
SUMMARY OF THE INVENTION 
MASSIVELY PARALLEL PROCESSOR 
COMPUTER 
ORIGIN OF THE INVENTION 5 An object of the present invention, therefore, is to 
2 
provide a multidimensional data processing computer 
that simultaneously processes a large number of parallel 
electrical signals to enable high speed processing of 
parallel data arrays. 
multidimensional data processing computer that oper- 
The invention described herein was made in the per- 
formance Of work under a NASA ‘Ontract and is sub- 
ject to the provisions of Section 305 of the National 
(72 Stat. 435; 42 U.S.C. 2457). 
Aeronautics and space Act O f  1958, Public Law 85-568 10 Another object is to provide a new and improved 
ates simultaneously on a large number of data streams in 
parallel for processing two dimensional imaging data in 
A further object of the invention is to provide a new 
and improved multidimensional data processing corn- 
puter composed of an array of parallel, identical pro- 
cessing elements that operate individually on parallel 
data streams in a multidimensional data array in re- 
Yet another object is to provide a new and improved 
BACKGROUND OF THE INVENTION 
The present invention relates generally to multidi- 
mensional data Processing computers, and more Partic- 15 
ularly, toward a single instruction, multiple data stream 
computer, comprising a large number of individual 
processing elements operating in parallel on multiple 
data streams in single bit slices, simultaneously and in an 
identical manner, in response to a single set of instruc- 20 sponse to a single set of instructions. 
tions stored in a processor array control unit. The mas- 
time. 
sively parallel processor architecture has particular 
utility to real time processing of image data generated 
by an image Sensor array as a large number of parallel 
data streams, each corresponding to a picture element 
(pixel). The architecture is also useful for processing 
any other ordered, multidimensional array of parallel 
data. 
Conventional digital computers are composed of 
devices that are programmed to perform logical opera- 
tions on one dimensional binary signals. These comput- 
ers, although possible to be adapted to process multidi- 
mensional binary signals, are inefficient and slow for 
that purpose because the multidimensional data must be 
converted to a single, serial data stream suitable for 
conventional single dimensional signal processing. 
There have been increasing applications in image 
processing and other spatially oriented computations 
where, for example, transmission of raw, multidimen- 
sional data from satellite based sensors to ground must 
undergo signal processing such as distortion correction 
and classification. Thus, there have been increasing 
requirements for multidimensional data processing 
computers that are fast enough to operate in real time 
on two or more dimension data (such as two dimen- 
sional imaging data) and compact enough to be carried 
on board in satellites, missiles or spacecraft. 
In response, various types of multidimensional data 
processors for applications such as image processing 
have been developed. The prior art includes a two di- 
mensional digital computer that operates on parallel 
optical signals arranged in an ordered array, including 
several different types of optical elements to provide 
direct image processing, such as sliding and interleav- 
ing. One embodiment of the computer operates in the 
optical domain using fiber optics and may be adapted to 
process electrical binary signals under program control. 
Also, operations on data are basically logical manipula- 
tions and complex operations such as arithmetic compu- 
tations are executed by multiple-step programs. 
Other approaches taken, wherein electrical image 
signals are processed for arithmetic as well as logical 
operations have been too complex for on-board utiliza- 
tion. For example, ‘‘giant” computers, such as the IL- 
LIAC IV, have been utilized wherein a number of data 
streams are processed in a smaller number of parallel 
processors. A substantial portion of the computation 
multidimensional data processing computer having a 
large number of identical processing elements operating 
in parallel in response to a single set of instructions to 
25 process an array of data streams in single bit data slices. 
Still another object is to provide a new and improved 
multidimensional data processing computer having an 
array of identical processing elements operating in par- 
allel in response to a single set of instructions stored in 
30 a processor array control unit, wherein the elements 
operate individually on a large number of incoming data 
streams in single bit slices defining an image plane, 
wherein the bits are logically and arithmetically pro- 
cessed as well as shifted among processor elements 
Yet another object is to provide a new and improved, 
multidimensional data processing computer of the type 
described above, that is relatively simple and compact, 
and thus adapted for onboard utilization in spacecraft, 
Still another object is to provide a new and improved 
processing element which is both simple and compact 
yet retains high speed and flexible capabilities. 
In accordance with the invention, a single instruction, 
45 multiple data stream computer comprises an N X M  
(most often, M=N) array of processing elements, indi- 
vidually and simultaneously operating on an N X M  
array of parallel streams of data under control of a 
single instruction set stored in a processing element 
SO array control unit. Data flow between the processing 
element array and control unit as well as with respect to 
peripheral devices is managed by a program and data 
management unit that is a general purpose mini com- 
puter having N bit input and N bit output data registers. 
55 The program and data management unit also loads pro- 
grams into the processor array control unit for execu- 
tion, supplies data to the processing elements, displays 
results and handles housekeeping such as diagnostics 
and interfacing. 
The array of processing elements is of particular 
importance to the invention. Each processing element is 
formed of three basic components, an arithmetic, logic 
and routing unit (ALRU), and I/O unit and a local 
memory unit, all interconnected in a bidirectional data 
65 bus. The ALRU contains three subunits, a binary coun- 
ter/shift-register subunit, a logic-slider subunit and a 
mask subunit. The logic-slider subunit contains a one bit 
storage register (P-register). This subunit executes logi- 
35 under single program control. 
40 satellites and the like. 
60 
3 
4,3 80,046 
4 
cal operations as well as slides bits to “nearest neigh- 
bor” processing elements in the array. 
The binary counter/shift-register subunit contains a 
series of registers (C-register). This subunit is operative 
selectively as a counter or shift register in response to 
command signals supplied by the array control unit. In 
the counter mode, the contents of the subunit are incre- 
mented by the instantaneous logic state of the bidirec- 
bodiments of the invention, simply by way of illustra- 
tion of the beat modes contemplated of carrying out the 
invention. As will be realized, the invention is capable 
of other and different embodiments, and its several 
5 details are capable of modifications in various obvious 
respects, all without departing from the invention. Ac- 
cordingly, the drawings and description are to be re- 
garded as illustrative in nature, and not as restrictive. 
BRIEF DESCRIPTION OF THE DRAWINGS 
tional data bus. In the shift register mode, the contents 
are downshifted by one stage, emptying the predown- 10 
shift value of the lowest stage of the regher to the data FIG. 1 is a block diagram showing the primary com- 
bus. While a closed ring configuration is disclosed, it ponents of a massively parallel processing computer, in 
should be emphasized that a conventional counter/- accordance with the invention; 
shift-register could be employed. FIG. 2 is a circuit diagram showing the basic struc- 
The mask subunit contains a one bit register (G-regis- 15 ture of each processing element in the ARU shown in 
ter). This subunit selectively inhibits both the P-register FIG. 1; 
and countedshift register in response to the array con- FIG. 3 is a data flow diagram showing the left to 
trol unit. In a masked mode, an instruction generated by right shifting characteristic of the S-registers in the I/O 
the array control unit will be executed in only those unit shown in FIG. 2; 
processing elements having their G-registers in a logical 20 FIG. 4 is a data flow diagram showing “nearest 
one state whereas in an unmasked mode, execution of neighbor” routing of data among the processing ele- 
instructions by the processing elements is not affected ments through corresponding logic/slider subunits in 
by the state of their corresponding G-registers. the ARU, 
The I/O (sub) unit serves as a storage element for FIG. 5 is a schematic diagram of the logic-slider 
input and output operations. The instantaneous logical 25 subunit for controlling data flow among neighboring 
state of the bidirectional data bus can be stored into the 
I/O unit in a one bit register (S-register), and similarly, FIG. 6A is a block diagram showing a preferred 
the logical state of the S-register can be read out to the embodiment of the binary counterhhift register 
data bus. The I/O unit is capable of shifting bits to the 
1/0 unit in neighboring processing elements. As dis- 30 FIG. 6B is a circuit diagram showing a stage of the 
closed, the bits are shifted only in a single direction 
(from left to right). Thus, in a 128 x 128 processing FIG. 6C is a circuit diagram of a rotating pointer used 
element array, a 128X 128 member, one bit slice data 
stream array will require 128 shifting operations to FIG. 6D is a circuit diagram of a downshift buffer 
move the data array into the processing element array. 35 storage and controller for the BC/SR; 
Another 128 shifting operations are required to move FIG. 7 is a schematic diagram of the mask subunit 
the data out of the processing element array. The data shown in FIG. 2; 
bits may be also, as aforementioned, moved directly FIG. 8 is a schematic diagram of an I/O unit shown 
between P-registers in a “nearest neighbor” fashion in a in FIG. 2; 
procedure termed “sliding.” Sliding enables an instanta- 40 FIG. 9 is a diagram showing flow of data and control 
neous one-bit slice of an image to be translated verti- signals with respect to the local memory unit of FIG. 2; 
cally or horizontally in the image plane. FIG. 10 is a signal timing diagram for operating the 
The single instruction characteristic of the present processing elements; 
architecture causes a common bit slice of all data FIG. 11 is a circuit diagram of a processing element 
streams to be operated upon simultaneously without 45 command and control signal distributor; and 
additional software. FIG. 12 is a signal timing diagram for operating the 
processing elements in the array; 
(BC/SR) subunit shown in FIG. 2 
BC/SR shown in FIG. 6A; 
in the BC/SR; 
The local memory unit is a multiple bit, random ac- distributor shown in FIG. 11. 
DESCRIPTION O F  THE PREFERRED 
EMBODIMENT 
cess memory (RAM), for storing the logical state of the 
data bus at a memory location addressed by the array 
control unit. Aeain. because the Drocessine element 50 
Y .  I
array is controlled by a single set of instructions in the Referring to FIG. 1, a massively parallel processor 
control unit, identical memory locations in all RAMS computer 20, in accordance with the invention, com- 
are simultaneously addressed for reading or writing. prises as its basic component a processing element array 
Data communication among the logic-slider subunit, unit (ARU) 22 which functions as a single instruction, 
counterhhift register and mask subunits of each ALRU 55 multiple data stream computer formed of an N X N 
as well as the corresponding I/O unit and the RAM on array of processing elements 44 (FIG. 2), to be de- 
the bidirectional data bus enables processing of single scribed below, operating under common control by an 
bit slices of the parallel stream data array under pro- array control unit (ACU) 24. 
gram control for diverse applications such as cross cor- The ACU 24 provides ARU 22 with instructions for 
relation, distortion correction and identification. Sliding 60 execution at a predetermined clock rate under control 
of data in the processor array is executed independently of a master clock (not shown) and includes instruction 
of other processing element operations so that data looping and subroutine handling capability. Data flow 
input and output can take place simultaneously with is managed among ARU 22 and ACU 24 and peripherial 
array computations. devices, such as a CRT display 28, a tape recorder 30, 
Still other objects and advantages of the present in- 65 disc memory 32 and printer 34, by a program and data 
vention will become readily apparent to those skilled in management unit (PDMU) 26. PDMU 26 loads pro- 
this art from the following detailed description, wherein grams into ACU 24 for execution along line 27 and also 
there is shown and described only the preferred em- provides input data along line 29 to the ARU 22. 
4,380,046 
5 
PDMU 26 further displays results and controls data The I/O unit 48 shall be described in detail below in 
housekee@g functions, such as test and diagnostic connection with FIG. 8. For the present, it is sufficient 
routines to both ACU 24 and ARU 22, along lines 27,B to say that the I/O unit 48 has a single bit storage S-reg- 
and manages all data flow and interfacing. PDMU 26, ister 4& which serves as a storage element for input and 
which is a general purpose mini computer, such as a 5 output of data with respect to the processing element 
PDP-11, manufactured by Digital Equipment Corpora- 44. The instantaneous logical state of data bus 52 can be 
tion, is provided with N-bit input and output data reds- stored, under control of ACU 24, in the S-register 480 
ters (not shown) for communication through data path of I/O unit 48 and conversely the logical state of the 
29 with the ARU 22 having the N x N square array I/O unit 48 can be read out through the data bus 52. The 
architecture, under control of ACU 24. IO S-register 48a in I/O unit 48 of any processing element 
System 20 is interfaced with a program interface unit 44 in ARU 22 can also receive an input bit from the 
36 and an interface module 38 (such as a DR-70 inter- S-register of the processing element to its lea, thus 
face module 80, manufactured by Digital Equipment achieving the transfer of the contents of all S-registers 
Corporation) to a host computer 40 (such as PDP to the S-registers of the processing elements to their 
11/70) programmed for operation, for example, as an 15 right. This latter mode is used for inputting and output- 
atmospheric and oceanographic information processing ting data with respect to the ARU 22, as illustrated in 
system (AOIPS) for supplying imaging data to the sys- FIG. 3. 
tem 20. Communication of programming data between LMU 50 contains a number of basic storage units, 
the computer 40 and PDMU 26 through the conven- e.g., 256 bits of random access memory (RAM). The 
tional interface unit 36 enables the host computer 40 to 20 logical state bit of data bus 52 can be stored into the 
request operation of system 20 directly for procesSing LMU 50 at any memory bit location addressed by ACU 
imaging data. Additional interface units 41 and 42 en- 24. Similarly, the bit stored at any memory location at 
able, respectively, direct control of array unit 22 by LMU 50 can be read out by the ACU 24. Of particular 
external control signals 88 an alternative to control by significance, the single instruction characteristic of the 
ACU 24 and accessing of data flowing between the 25 massively parallel processor architecture of the present 
ARU 22 and PDMU 26. invention cause8 identical addressing of all LMUs in the 
In accordance with the data processing strategy of ARU 22 for reading or writing. In actual implementa- 
the present invention, a large number (N2, where N is an tion, commercial RAM integrated circuit chips can be 
integer on the order of at least 128) of streams of data in used in LMU 50 although these chips are usually ori- 
an (NXN) array having strong spatial characteristics, 30 ented towards multiple-bit-words. In this case, each bit 
such as raw imaging data, generated by an AOIPS com- in the word of the RAM chips corresponds to LMU 50 
puter, are simultaneously processed in parallel within for one processing element &, as many processing ele- 
the individual processing elements 44 constituting ARU ments 44 as the number of bits in the word will be pro- 
22. In a two dimensional system for processing imaging vided with local memory unit 50 which are all housed in 
data, for example, the (Nx N) data streams are supplied 35 one integrated circuit chip. Further details of the struc- 
to ARU 22 where, under control of ACU 24, they are ture and operation of the LMU 50 shall be described 
simultaneously processed in elements 44 under a single below in connection with FIG. 9. 
set of instructions, in single bit data slices constituting Intercommunication among the Nz processing ele- 
binary image planes. All of.the processing elements 44 ments 44 within ARU 22 is by two separate routing 
of the array 22 are identical to each other, i.e., are con- 40 networks. Referring to FIG. 3, data flow among the 
stituted by identical electrical components. Thus, a S-registers 48a of array 22, as mentioned above is only 
considerable number of the processing elements 44 (ap- from left to right. An N x N  array of parallel data 
proximately four, using present technology) can be streams is loaded into ARU 22 by entering one N-bit 
fabricated on a single LSI chip. As will become clear column of the data array via the N-bit input port into 
from the following, the data in each image plane can be 45 the first (lef? hand side) column of S-registers 480 of the 
modified, under the program control, to undergo arith- array of I/O units 48 shown in FIG. 3. This N-bit col- 
metic as well as logical processing and can be translated umn of the data array comes either from the program 
as a single block in vertical or horizontal directions in a and data management unit M via data paths 29, or from 
process known as “sliding.” external devices through the N-bit I/O data-interface 
Referring to FIG. 2, the basic structure of each pro- M 42. The entire array of data is then successively shifted 
cessing element 44 in ARU 22 includes an arithmetic N positions to the right until the array of S-registers 48a 
logic and routing unit (ALRU) 46, an input and output contains a complete one bit image plane stored therein. 
(I/O) unit 48 and a local memory unit (LMU) 50 in the This image plane is then stored into the LMU 50 for 
form of a single-bit, random access memory (RAM), all later processing by a transfer from the S-register 480 
interconnected on a bidirectional data bus 52 which 55 into corresponding memory cells at some memory loca- 
transfers data on a single-bit basis. Multiple-bit logic and tion of the LMU 50 via the bidirectional data bus 52. 
arithmetic operations are performed with special algo- Usually the raw imaging data are digitized into a num- 
rithms which are based on bit-serial data transfers along ber of bits of precision, so that the above process of 
data bus 52, and on bit-wise functions provided by the inputting one bit image plane is repeated as many times 
ALRU 46. 60 as the number of bits of precision. Following processing 
ALRU 46 constitutes three functional components, a of the stored raw data in the array of processing ele- 
binary counterhhift register (BC/SR) subunit 54, a ments 44 by logical and arithmetic operations in the 
logic-slider subunit 56 including a single bit register ALRU 46 and LMU 50, the image planes are trans- 
(P-register) and a mask subunit 58 including a second ferred from LMU 50 into the S-registers, one bit plane 
single bit register (G-register). The BC/SR 54, logic- 65 at a time, and then read out from the array 22 by shifting 
slider subunit 56 and mask subunit 58 are connected all the stored bits N positions to the right through the 
together within ALRU 46 along data and control lines output port. One N-bit column at a time is stored either 
52,60,62,64 and 66. into the PDMU 26 through data path 29, or into exter- 
6 
1 
4,380,046 
8 
nal devices through the I/O data interface 42. The 
PDMU 26 in FIG. 1 contains N-bit input and output 
data registers for the above storage of N-bit column 
data. 
In another mode, bits are shifted vertically or hori- 
zontally directly among neighboring processing ele- 
ments 44 in a nearest neighbor fashion without passing 
through the I/O unit 48 in a process termed “sliding.” 
Referring to FIG. 4, the “nearest neighbor” routing 
incorporated by the logic-slider subunits 56 of ARU 22 
under control of ACU 24 is illustrated. Each block 
represents a processing element 44 in ARU 22 with 
element (i, j) in the center of FIG. 4 representing a 
genera1 processing element, and the remaining eight 
processing elements shown in the Figure being the 
“nearest neighbor” elements in the array. It is to be 
understood that whereas nine elements are shown in 
FIG. 4 for the purpose of illustration, an actual array 
may contain, for example, 16,384 processing elements in 
a 128 X 128 (N X N) array. Neighboring horizontal pro- 
cessing elements are connected together by three sepa- 
rate lines LI, L2 and L3 whereas neighboring vertical 
elements are interconnected by lines L4 and L5. During 
inputting and outputting of data, in the manner de- 
scribed above with respect to FIG. 3, data bits are trans- 
ferred between processing elements through I/O units 
48 from left to right along paths L2 (I/O units are not 
shown in the path L2 for simplicity). Sliding of data up, 
down, left and right in the nearest neighbor fashion, 
however, is made directly on lines L1 and L3, respec- 
tively, for left or right direction data slides, and on lines 
L4 and L5, respectively, for up and down data slides. 
Data caused to slide beyond a processing element on the 
outer boundaries of ARU 22 are lost; feedback, how- 
ever, to opposite boundaries for “wraparound” data 
routing may optionally be provided. 
An overview of the basic components of the mas- 
sively parallel processor computer 20 having been 
given above, the structure and operation of the com- 
puter shall now be described in detail with reference to 
FIGS. 5-12. For the purpose of the following discus- 
sion, the following assumptions will be made. For all 
gates and signals, logical one is represented by a high 
signal level and logical zero is represented by a low 
signal level; all tri-state output gates invert their input 
signals; all D-type flip flops are triggered by the rising 
edges of the signals presented to their clock inputs, data 
being strobed into the flip flops at these rising edges; 
and all toggle flip flops are toggled (Le., states of the flip 
flops change from 0 to 1, or from I to 0) at the rising 
edges of their input signals. Development and detailed 
characteristics of various control signals described 
throughout the Figures shall be described in detail in 
connection with FIGS. 10-12. 
Referring first to FIG. 5, the structure and operation 
of logic-slider 56 within ALRU 44 are now discussed. 
Logic-slider 56 comprises a flip flop 76 functioning as 
the basic storage register, or P-register, for storing a 
single bit data slice together with logic circuitry for 
performing logic operations and for routing the single 
bit to and from the P-registers of four neighboring pro- 
cessing elements 44 in the ARU 22 in “nearest neigh- 
bor” fashion. Flip flop 76 is in communication with the 
bidirectional data bus 52 through tri-state output gate 78 
for transferring the content of flip flop 76 onto the bus, 
and through multiplexer 80 and logic gates 82 for trans- 
ferring the lomcal state of the data bus to the flio floe. 
gates 82 controls the output 85 of the gates to invert or 
directly pass the instantaneous logical state of the bidi- 
rectional data bus 52 to the multiplexer 80. The Q out- 
put of flip flop 76 is fed back to the input of multiplexer 
5 80 through an AND gate 86, an exclusive OR gate 88 
and an OR gate 90 to perform logical operations upon 
the input bit at line 85 under control of ACU 24 through 
multiplexer control lines 92. The result of the selected 
logical operation is the replacement of the original con- 
10 tent of the P-register. The output Q of flip flop (P-regis- 
ter) 76 is supplied through output lines 98, to the P-reg- 
isters of the four neighboring processing elements 44 in 
ARU 22 through a second multiplexer 94 in each ele- 
ment. 
The second multiplexer 94 controlled by the ACU 24 
through multiplexer control lines 97 selectively supplies 
an input from any of the four nearest neighbor process- 
ing elements 44 in array 22 to flip flop 76. The two 
control lines 97 enable a bit from one of the four input 
20 lines 98 to be passed through multiplexer 94 by digital 
encoding. 
Thus, the logic circuitry associated with flip flop 76 
enables transfer of bits from any of the four nearest 
neighbor processing elements to the P-register (flip flop 
25 76) and enables any of several logical operations (gates 
86,88 and 90) to be selectively applied to the stored bit 
and the selected input signal on line 85 under control of 
the ACU 24. The logic circuitry also transfers the out- 
put of flip flop 76 to all four nearest neighbor processing 
30 elements through output line 98, selectively transfering 
the output to the P-register 76 in one of these processing 
elements. 
Control signals supplied to input 102 of the flip flop 
76 cause data from bus 52 to be stored in the flip flop 
35 from multiplexer 80 for processing. Also, control sig- 
nals supplied from ACU 24 to control input 100 of 
tri-state output gate 78 cause processed data to be read 
out from flip flop 76 onto the bidirectional data bus 52. 
Thus, as discussed briefly above, whereas bits on data 
40 bus 52 are inputted and outputted with respect to the 
processing elements 44 only through the I/O unit 48, 
bits are also directly transferred among processing ele- 
ments through sliding via data lines 98. Logical manipu- 
lation of bits in P-registers 76 are performed indepen- 
Referring to FIGS. 6A-6D. and initially to FIG. 6A, 
BC/SR subunit 54 is similar in function to a ripple 
counter with the additional capability of downshifting 
its stored contents. The BC/SR 54 comprises eight 
50 storage registers 104 arranged in the form of a ring, with 
each stage 104 being connected to a buffer storage and 
controller unit 112 (shown in detail in FIG. 6D) 
through input port 114 and output port 116. The selec- 
tion of eight storage registers is arbitrary. The number 
55 selected in a given design depends on the required pre- 
cision which, in turn, depends on the anticipated com- 
putations in a given application. Communication be- 
tween the bidirectional data bus 52 and storage/con- 
troller 112 is through data port 118. Each stage 104 
60 (shown in detail in FIG. 6B) of BC/SR 54 is adapted to 
send a carry signal to the next higher stage via a carry- 
out port 1#1 and is adapted to receive a carry signal 
from the next lower stage via toggle-in port 112. 
The lowest stage of the BC/SR subunit 54 is defined 
65 by the @tion of a rotating pointer shown symbolically 
as counterclockwise arrows in the center region of 
subunit 54 in FIG. 6A. The rotating winter 124, shown 
15 
45 dently of the input/output mode. 
A sel& sign; supplied by ACU 24 to one inpui 84 of in detail in FIG. 6C, has a unique $iter output temi- 
493 
9 
nal 127 for each one of the BC/SR stages 104. The 
pointer 124 comprises eight, three-input AND gates 126 
having inputs connected to a three-line data bus 130 
upon which the output of a three-stage counter 128 is 
applied. Each of the gates 126 has a unique, internally 
wired logic to cause the outputs thereof to be succes- 
sively high in response to counter 128, the output of one 
gate being high at any instant of time. Thus, with 
counter 128 up-counting in a free-running mode, a high 
output signal on the gate 126 output is continuously 
circulated to successively address the binary BC/SR 
stages 104. In this manner, the stage of the BC/SR 54 
identified as “lowest” is continuously moved to succes- 
sive stages on the ring. The lowest stage of BC/SR 54 is 
significant because communication between BC/SR 54 
and data bus 52 is via the lowest stage. The BC/SR 54 
design shown in FIGS. 6 A 4 D  allows a BC/SR down- 
shift operation almost immediately following a BC/SR 
increment operation because the downshift operation 
does not physically shift the BC/SR, whereas only the 
rotating pointer 124 changes the position of the lowest 
BC/SR stage, and so there is no need to wait for the 
propagation of ripple carry signals from stage 104 to 
higher stages arising from the preceding BC/SR incre- 
ment operation. 
Buffer storage/controller 112 in FIG. 6D stores the 
bit outshifted from the “lowest stage” of the BC/SR 
subunit 54 to be written into LMU 50 logically or arith- 
metically combined with the present stage of the corre- 
sponding P-register 76 or stored into other P-registers 
76 along the data bus 52. Storage/controller 112 also 
generates the necessary control signals to all BC/SR 
stages 104 as well as to the rotating pointer 124. 
Referring to FIG. 6D in more detail, the BC/SR 
controller portion 125 of storage/controller 112 com- 
prises an array of gates that receive command signals 
from ACU 24 representing downshift, clear and incre- 
ment, and generate corresponding control signals to 
components of the BC/SR subunit 54. The increment 
command at input port 126 is stored in a D flip flop 128. 
The clock signal from a master clock (not shown) at 
input port 130 strobes flip flop 128 following inversion 
in inverter 134 so that the increment command bit is 
stored in flip flop 128 at the trailing edge of the clock 
pulse. AND gate 136 outputs an increment control 
signal during the first portion of the next cycle period 
defined by the master clock. 
Three control signals are generated by BC/SR con- 
troller 125 for downshift operation. The first signal 
(downshift control) is obtained from the output of AND 
gate 138 which transfers the downshift command ap- 
plied to output port 140 in synchronism to the clock 
signal at clock input port 130. The second signal (down- 
shift completion control) is obtained from the output of 
AND gate 142 responsive to a coincidence of a clock 
signal generated by mask subunit 58 (FIG. 2) and the 
downshift command signal applied at port 141 and 
transferred through flip flop 144 in synchronism with 
the master clock signal at port 130. The third control 
signal (delayed downshift control) is generated by 
AND gate 146 in response to the downshift command 
supplied by ACU 24 to port 141 and to an inverted 
clock signal from the mask subunit 58 applied to input 
port 148. The output of gate 146, identified by 150 is 
supplied to the counter 128 in FIG. 6C. A clear control 
signal generated by AND gate 147 is synchronized to a 
clear command supplied by ACU 24 at input 149 and 
the inverted clock signal at line 148. 
80,046 
10 
Prior to transfer of data from bus 52 to the lowest 
stage of BC/SR 54 for in increment operation, the data 
passes through gates 151 as well as inverters 152. When 
a data bus control signal on input 154 is low, the output 
5 156 of gates 151 increments BC/SR 54 via increment 
bus 158 if the instantaneous state of data bus 52 is high. 
If the data bus control signal is high, on the other hand, 
the complemented data bus state is applied to the incre- 
ment bus on output 156 and increments the BC/SR 54 if 
10 the instantaneous state of data bus 52 is low. Thus the 
BC/SR 54 is incremented by either the true or comple- 
mented logical state of data bus 52 depending respec- 
tively on the low or high state of the data bus control 
signal applied at 154. 
The downshift buffer storage portion 160 of buffer 
storage/controller 112 comprises a D-type flip flop 162 
having an output 164 connected to bidirectional data 
bus 52 through a tri-state output gate 166. Information 
shifted out from the lowest stage of BC/SR subunit 54 
20 is supplied to flip flop 162 along downshift bus 168 
(FIG. 6A). The shifted out information is stored into 
flip flop 162 at the trailing edge of the master clock 
signal by being synchronized to the inverting clock 
signal from the mask subunit 58 through the gate 146. 
Referring again to FIG. 6B, a single stage 104 of 
BC/SR 54 is shown in detail. The stage 104 comprises a 
toggle flip flop 168 having a clear input 170 and a toggle 
input 171. The output Q 172 of the flip flop 168 is sup- 
plied to the downshift bus 168 through a tri-state gate 
30 174 that is controlled by an AND gate 176 responsive to 
the downshift control signal supplied by line 140 of 
controller 112 (FIG. 6D) and the pointer control signal 
generated by pointer 124 (FIG. 6C). Each of the stages 
104 of BC/SR subunit 54 receives a toggle input from 
35 the next lower stage of 54 through AND gate 178 and 
OR gate 180. The lowest stage of subunit 54, which is 
the one receiving a high signal from pointer 124, re- 
ceives a corresponding low signal through inverter 182. 
This effectively disconnects the lowest stage of the 
40 BC/SR subunit 54 from its neighboring lower stage in 
the ring structure shown in FIG. 6A. Toggle flip flop 
168 is subsequently reset when the downshift comple- 
tion control signal is high after the high pointer control 
signal has been transferred from the original lowest 
45 stage to the next higher stage. This is effected through 
OR gate 184 receiving the clear control signal on line 
186 for clearing the entire BC/SR subunit 54, and the 
downshift completion control signal on line 188 at the 
end of a BC/SR 54 downshift operation. 
The increment control signal generated by gate 136 in 
FIG. 6D is essentially a clock signal from the mask 
subunit 58 phase shifted by one full cycle period. Thus, 
the lowest stage of BC/SR subunit 54 (identified by a 
high signal at its pointer control input terminal 190, as 
55 shown in FIG. 6B) will be toggled if the selected toggle 
input is also high. This happens at the rising edge of the 
clock signal during the subsequent clock period. 
During a downshift command, the tri-stage output 
gate 174 (FIG. 6B) of the lowest stage is closed, supply- 
60 ing the logical state of the stage onto the data bus 52. 
Then, the rising edge of the clock signal from AND 
gate 146 (FIG. 6D) of the BC/SR controller line at 150, 
that is, the delayed downshift control signal, which 
coincides with the trailing edge of the clock signal from 
65 the mask subunit 56, strobes the downshift bus 168 to 
transfer its instantaneous state into flip flop 162 in FIG. 
6D. The falling edge of the delayed downshift control 
signal increments the counter 128 of the rotating point 
15 
25 
50 
4,380,046 
11 
124 (FIG. 6C) and thus moves the pointer location to 
the next higher stage. Furthermore, during the high 
period of the output of AND gate 146 (FIG. 6D), the 
tri-state output gate 166 of the downshift buffer storage 
160 is closed so that the content of the lowest stage 
already stored in the downshift buffer storage can be 
read out and transferred to the data bus 52. During the 
high period of the downshift completion control signal 
at terminal 192 (FIG. 6B), because the original lowest 
stage has already become the new highest stage of the 
BC/SR subunit 54, this stage must be reset through 
AND gate 187 and OR gate 184. 
Finally, it is to be noted that the clear control signal 
applied to line 186 of gate 184 in FIG. 6B resets all 
BC/SR stages 104. The clear operation can be per- 
formed either in a masked or unmasked mode. 
Referring to FIG. 7, mask subunit 58 which controls 
the operation of BC/SR subunit 54 as well as the logic- 
slider subunit 56 in response to a mode control signal, is 
shown in detail. The mask subunit 58 comprises a regis- 
ter (G-register) 200 that stores a mask bit which selec- 
tively inhibits or activates logic-slider subunit 56 and 
BC/SR subunit 54 if the ACU 24 calls for a masked 
mode of operation that is communicated to the mask 
subunit 58 over the bidirectional data bus 52. The high 
or low signal indicating whether or not a masked mode 
is called for is clocked into G-register 200 by a delayed 
write in mask command generated by ACU 24 onto flip 
flop clock line 202. The bit stored in G-register 200 is 
thereafter transferred, through logic circuit 205, to 
gates 206 and 20% to be transferred in the inverted and 
noninverted forms, respectively, to the BC/SR subunit 
54 and through gate 210 to the logic-slider subunit 56 in 
response to the master clock signal supplied to line 212 
and the delayed P-register write in command applied to 
line 214. The logic circuit 205 synchronizes enablement 
of the gates 206,208 and 210 with respect to generation 
of the mode command on line 204. The command inputs 
shown in FIG. 7 are generated by control signal distrib- 
utor 250 illustrated in FIG. 11. 
Referring to FIG. 8, I/O unit 48 comprises a single 
bit register 216 (S-register) that receives data from bidi- 
rectional data bus 52 through gate 218 and gate 220. 
The bit stored in register 216 is read out onto the data 
5 
10 
15 
20 
25 
30 
35 
40 
bus 52 through tri-sGte output gate 122 under control of 45 
line 223. Control signals applied to lines 224 and 226 to 
the input of gates 222 and 218, respectively, determine 
whether the subunit 48 is executing a slide operation or 
simply storing information directly from data bus 52. 
Storage of data into S-register 216 is synchronized to 5 0  
the master clock on line 228. Input 230 to gate 222 is 
supplied from the neighboring processing element to 
the right, whereas the output 232 of register 216 is sup- 
plied to the neighboring processing element to the left. 
The input control signals applied to lines 223, 224, 226 55 
and 228 are all generated by control signal distributor 
250 (FIG. l l ) ,  described below. 
Referring to FIG. 9, LMU 50 comprises a random 
access memory (RAM) 240 that is addressed by ACU 
24 on address line 242. The memory 240 is in a write 60 
mode when a high write enable signal is applied to line 
244. If this signal is low, LMU 50 is in the read mode 
and the contents of the RAM at an address location are 
transferred from the memory to the bidirectional bus 52 
by a control signal applied to line 246. The signals ap- 65 
plied to lines 244 and 246 of the LMU 50 are generated 
by distributor 250 shown in FIG. 11. It is noted that the 
RAM 240 of LMU 50 as well as the I/O unit 48 de- 
scribed in FIG. 8 are not controlled by the mask subunit 
58 shown in FIG. 7. 
Referring to FIG. 10, timing and sequencing of the 
basic operations of the processing elements 44 are 
shown. Two successive cycles in the master clock s ig  
nal having a cycle period M are illustrated. Each cycle 
has a high level that extends for a duration of TI, and a 
low level that extends for a duration of T2. The repeti- 
tion rate of the clock signal is determined by the mini- 
mum access time and maximum clock rate of the RAM 
used in LMU 50, and in practice, is ten megahertz. 
An instruction to be executed during each clock per- 
iod stored in an array instruction register (not shown) 
within ACU 24 becomes available to the processing 
elements 44 at the rising edge of the clock cycle, as 
shown. During the high level period Ti, the lowest 
stage of BC/SR subunit 54 is read out and stored into 
downshift buffer storage 160 (FIG. 6D) at the trailing 
edge of the high level period T1 (assuming that the 
instruction calls for a downshift operation). During the 
low level period T2, data stored in the P- and S-regis- 
ters, 76 (FIG. 5) and 216 (FIG. 8), respectively, are read 
out if a logic or arithmetic operation or a data routing 
(sliding) operation is called for. If data must be read 
from LMU 50, the RAM in the LMU 50 is accessed 
during the T2 period. If a BC/SR downshift operation is 
called for, the data already sorted in the buffer storage 
160 (FIG. 6D) during period TI will now be read out 
into the bidirectional data bus 52. 
During the rising edge of a subsequent clock cycle 
(M + I), any writing of data into the P, S or G-registers 
called for by instruction M will be executed. If the 
present array instruction M calls for incrementing the 
BC/SR subunit 54, the subunit is incremented at the 
lowest stage of the subunit by the instantaneous state of 
the data bus or by the logical complement of the state of 
the data bus as discussed above in connection with FIG. 
6A. 
Thus, as illustrated in FIG. 10, the execution period 
of an array instruction is slightly longer than one cycle 
(Tl+T2). In fact, the last portion of the execution per- 
iod of instruction M overlaps with the beginning por- 
tion of the execution period of instruction M+ 1. The 
actual execution rate of the computer, however, is mea- 
sured by the cycle rate rather than the execution period. 
The above overlap is intentionally built into the design 
in order to achieve the highest throughput. 
The generation of control signals for operating the 
1/0 unit 48, the logic-slider subunit 56, the mask subunit 
58 and the LMU 50 are generated by a processing ele- 
ment command and control signals distributor 250, 
shown in FIG. 11. 
The timing relationships among the various input and 
output signals in the distributor 250 are illustrated in 
FIG. 12. The distributor 250 comprises five D-type flip 
flops 252,254,256,258 and 260 for delaying by a period 
TI the incoming command signals generated by the 
instruction register of ACU 24. These flip flops are 
strobed by the inverted master clock signal on line 268 
so that data are stored into said flip flops at the trailing 
edges of the clock pulses (the leading and trailing edges 
of the clock pulses are separated by the time duration 
AND gate 262 combines the output of flip flop 252 
and the master clock signal on line 267 to form a pulse 
of duration TI starting at the rising edge of the subse- 
quent cycle, as shown in FIG. 12u. The output of gate 
262 is supplied to the mask subunit 58 at line 202 shown 
TI). 
4,3 80,046 
13 
also in FIG. 7. The delayed outputs of flip flops 254 and 
2!% are sbpplied, respectively, to lines 204 and 214 of 
the mask subunit !U3. Timing of the signals generated by 
registers 2!M and 256 is shown, respectively, in FIGS. 
12b and 12c 
The signals to lines 244 and 246 of LMU 50 (FIG. 9) 
are supplied, respectively, by gates 264 and 266 in FIG. 
11. The outputs of gates 264 and 266 are responsive, 
respectively, to write and read commands generated by 
the array instruction register within ACU 24 synchro- 
nized to the master clock inverted line 268. The timing 
of the write and read signals is shown in FIG. 126 and 
12, respectively. 
The control signals supplied to lines 223,224,226 and 
228 of VO unit 48 (FIG. 8) are generated, respectively, 
by gate 270, flip flop 260, gate 272 and flip flop 258. The 
outputs of gates 270 and 272 are synchronized to clock 
line 267, whereas the outputs of flip flops 258 and 260 
are delayed by the period TI. Timing of the signals 
generated by the flip flop 260 and gate 270 is shown, 
respectively, in FIGS. l w a n d  12h. Timing of signals 
generated by the flip flop 258 or 260 and by gate 272 is 
shown in FIG. 12g. 
The gating signal supplied to line 100 of logic-slider 
56 (FIG. 5) is generated by gate 251 synchronized to 
inverted clock line 268 as shown in FIG. 11. 
Thus, each processing unit 44 under the control of a 
common set of instructions stored in ACU 24 can be 
programmed to provide any predetermined logical or 
arithmetic manipulation on a bit stored in each process- 
ing element 44 in ARU 22. In general, every stored bit 
in the ARU 22 is operated on identically, however, 
certain predetermined processing elements may be in- 
hibited by its mask subunit 58, if so desired, by program- 
ming the system in mask mode. 
Because each processing element 44 contains all of 
the logical and arithmetical components necessary to 
perform a wide variation of data manipulations, the 
system 20 is highly versatile, and can be adapted to 
perform complex algorithmic operations, such as cross 
correlation for image identification, image rotation, 
classification, distortion correction and other forms of 
image analysis. 
In this disclosure, there is shown and described only 
the preferred embodiments of the invention, but, as 
aforementioned, it is to be understood that the invention 
is capable of use in various other combinations and 
environments and is capable of changes or modifica- 
tions within the scope of the inventive concept as ex- 
pressed herein. 
What is claimed is: 
1. An apparatus for processing multidimensional, 
digital serial-by-bit data characterized by an ordered 
array of parallel data streams, comprising an ordered 
array of interconnected parallel processing elements 
corresponding to all or part of the data streams, and a 
control unit connected to said processing elements for 
causing said processing elements to process the data 
streams in response to a single set of instructions, each 
of said processing elements comprising a subunit A 
including means for arithmetic, shifting and memory 
operations, a subunit B including means for storing data, 
performing logical operations and sliding the stored 
data to a similar subunit in a neighboring processing 
element, a subunit C including means for storing, input- 
ting and outputting data, a subunit D including addi- 
tional memory means, and a bidirectional bus, all of said 
14 
subunits being connected to said bidirectional bus for 
providing communication between said subunits. 
2. The apparatus of claim 1, wherein subunit A in- 
cludes a counter/shift register. 
3. The apparatus of claim 2, wherein said counter/- 
shift register includes means for storing bits, means 
responsive to a first command signal from said control 
unit for shifting said stored bits, and means responsive 
to a second command signal from said control unit for 
4. The apparatus of claim 2, wherein said counter/- 
shift register comprises a plurality of registers arranged 
in a closed ring configuration, pointer means for supply- 
ing a pointer signal to said register ring for defining the 
15 lowest register in said ring, and counter means for suc- 
cessively indexing said pointer means, the lowest regis- 
ter, defined by said pointer means, outputting its con- 
tent to said common bus. 
5. The apparatus of claim 1, wherein subunit D in- 
6. The apparatus of claim 1, wherein each of said 
processing elements further includes a subunit E includ- 
ing means for selectively inhibiting the operability of 
subunits A and B, said subunit E also being connected to 
7. An apparatus for processing multidimensional, 
digital serial-by-bit data in the form of an N x M array 
of parallel data streams, comprising a first N x M array 
of subunits A each including means for arithmetic, shift- 
30 ing and memory operations, a corresponding, second 
N X M array of subunits B including means for storing 
data, performing logical operations and sliding stored 
data to similar subunits in said array, a corresponding, 
third N X M  array of subunits C including means for 
35 storing, inputting and outputting data, and a corre- 
sponding, fourth N X M array of bidirectional buses, 
said arrays being interconnected in an ordered fashion, 
means for transferring data among said subunits and said 
arrays including said bidirectional buses, and a control 
40 unit connected to said arrays for controlling processing 
of all of said data streams in said first, second and third 
arrays in accordance with a single set of instructions. 
8. An apparatus for processing multidimensional, 
digital serial-by-bit data in the form of an N x M array 
45 of parallel data streams, comprising an N x M  array of 
interconnected parallel processing elements corre- 
sponding in position, respectively, to the parallel data 
streams, and a control unit connected to said processing 
elements responsive to a single set of instructions for 
50 causing said array of processing elements to perform 
identical and simultaneous operations on single bit slices 
of the parallel data streams, each of said processing 
elements comprising a subunit A including means for 
arithmetic, shifting and memory operations, a single bit 
55 subunit B for storing a bit and including means for per- 
forming logical and sliding operations, a subunit D 
having additional memory means, and a bidirectional 
bus, each of said subunits being connected to said bidi- 
rectional bus for providing communication between 
9. The apparatus of claim 8, wherein the memory of 
subunit D provides for random access. 
10. The apparatus of claim 8, wherein said control 
unit provides means for sliding the data content of a 
65 subunit B to another subunit B of a neighboring process- 
ing element. 
11. The apparatus of claim 8, wherein each process- 
ing element further includes a subunit E for inhibiting 
5 
10 digitally adding said stored bits to an incoming bit. 
20 cludes a random access memory. 
25 said bus. 
60 said subunits. 
15 
4,380,046 
16 
the operations of said subunits A and B in response to a array of interconnected parallel processing elements 
mask mode command generated by said control unit. correspoding to all, or part, of the data streams and a 
12. The apparatus of claim 8, wherein said subunit A control unit connected to said processing elements for 
includes a WUXlter/Shift-regiSteter including mepnS for causing said processing elements to prwess the 
storing bitsy means responsive to a first command Si@ 5 streams in response to a single set of inktructions, each 
from a control unit for shifting said stored bits, and of said processing elements, in turn, a sub- 
from said unit including a binary counter/shift register, a subunit 
control unit for digitally adding said stored bits to an including logic for data to and of a plurality of 
adjacent processing elements, a masung subunit for incoming bit. 
optionally inhibiting a given processing element from shift-register comprises a plurality of registers arranged 
ing a pointer to =id register ring for defining the including storage and means for inputting or outputting 
lowest register in =id ring, and counter means for sue- data from a given processing element, a subunit includ- 
cessively indexing said pointer means, the lowest regis- 15 ing additional memory Over that Provided by the Sub- 
ter, defined by said pointer means, outputting its con- unit including the binary counter Shift register, and a 
tent to said bidirectional bus. bidirectional bus, all of said subunits being directly con- 
14. An apparatus for processing multidimensional, nected to said directional bus, said interconnection d- 
digital serial-by-bit data characterized by an ordered lowing for communication between said subunits. 
responsive to a second command 
13. The apparatus of claim 8, wherein said counter/- 10 
in a c l o d  ring configuration, pointer means for supply- resPondinO to a simal from mid control unit, a subunit 
array of parallel data streams, comprising an ordered 20 * * * * *  
25 
30 
35 
40 
45 
55 
60 
65 
