Multinode reconfigurable pipeline computer by Nosenchuck, Daniel M. & Littman, Michael G.
United States Patent [191 [11] Patent Number: 431 1,2 14 
Nosenchuck et al. [45] Date of Patent: Mar. 7, 1989 
MULTINODE RECONFIGURABLE 
PIPELINE COMPUTER 
Inventors: Daniel M. Nosenchuck, Mercerville, 
N.J.; Michael G. Littman, 
Philadelphia, Pa. 
Assignee: Princeton University, Princeton, N.J. 
Appl. No.: 931,549 
Filed: Nor. 14, 1986 
Int. Cl.4 ......................... GO6F 9/00; GQ6F 15/16 
U.S. c1. .................................................... 364/200 
Field of Search ... 364/736,748,749,200 MS File, 
364/900 MS File 
References Cited 
U.S. PATENT DOCUMENTS 
4,621,339 11/1986 Wagner et al. ..................... 364/900 
4,761,755 8/1988 Ardini, Jr. et al. ................. 364/749 
OTHER PUBLICATIONS 
Schneck et al., “Parallel Processor Programs in the 
Federal Government”, lEEE Computer. Jun., 1985, pp. 
Kogge, Peter, “The Microprogramming of Pipelined 
Processors”, The 4th Annual Symposium on Computer 
Architecture Conference Proceedings, ACM, Mar. 1977, 
Nosenchuck et al., “Two-Dimensional Nonsteady Vis- 
cous Flow Simulation on the Navier-Stokes Computer 
MiniNode”, Journal of Scientific Computing, vol. 1, No. 
Primary Examiner-Gareth D. Shaw 
Assktanr Examiner-Jonathan C. Fairbanks 
43-56. 
pp. 63-69. 
1, 1986, pp. 53-71. 
3,787,673 1/1974 Watson et al. ...................... 364/736 at tome^. Agent, or Firm--Richard C. Woodbridge 
3,978,452 8/1976 Barton et al. ....................... 364/200 
4,051.551 9/1977 Lawrie et al. ....................... 364/200 
4,101,950 7/1978 Stokes et al. ........................ 364/200 
4,161,036 7/1979 Moms et al. ....................... 364/900 
4,174,514 11/1979 Sternberg .............................. 382/49 
4,225,920 9/1980 Stikes et al. ......................... 364/200 
4,228,497 10/1980 Gupta et al. ........................ 364/200 
4,244,019 VI981 Anderson et al. .................. 364/200 
4,247,892 1/1981 Lawrence ........................... 364/200 
4,270.181 5/1981 Tanakura et al. ................... 364/736 
4,307,447 12/1981 Provanzano et al. ............... 364/200 
4,363,094 12/1982 Kaul et al. .......................... 364/200 
4,438,494 5/1984 Budde et al. ........................ 364/200 
4,442,498 4/1984 Rosen .................................. 364/745 
4,454,489 6/1984 Donazzan et al. .................. 333/227 
4,454,578 6/1984 Matsumoto et al. ................ 364/200 
4,467,409 8/1984 Potashetal. ....................... 364/200 
4,482,953 11/1984 Burke et al. ......................... 364/200 
4,491,020 1/1985 Chubachi .............................. 73/606 
4,498,134 2/1985 Hansen et al. ...................... 364/200 
4,507,728 3/1985 Sakamoto ............................ 364/200 
4,589,067 5/1986 Porter et al. ........................ 364/200 
4,594,655 1/1986 Hao et al. ............................ 364/200 
4,612,628 9/1986 Beauchamp et al. ............... 364/748 
...................... is71 ABsTRAcr 3,875,391 4/1975 Sbapiro et al. 364/736 
3,990,732 11/1976 Reicherl .................................. 289/2 A multinode parallel-processing computer is made up of 
a plurality of innerconnected, large capacity nodes each 
including a reconfigurable pipeline of functional units 
such as Integer Arithmetic Logic Processors, Floating 
Point Arithmetic Processors, Special Purpose Proces- 
sors, etc. The recontigurable pipeline of each node is 
connected to a multiplane memory by a Memory-ALU 
switch NETwork (MASNET). The reconfigurable 
pipeline includes three (3) basic substructures formed 
from functional units which have been found to be 
suficient to perform the bulk of all calculations. The 
MASNET controls the flow of signals from the mem- 
ory planes to the reconfigurable pipeline and vice versa. 
the nodes are connectable together by an internode data 
router (hyperspace router) so as to form a hypercube 
configuration. The capability of the nodes to condition- 
ally configure the pipeline at each tick of the clock, 
without requiring a pipeline flush, permits many power- 
ful algorithms to be implemented directly. 
24 Claims, 9 Drawing Sheets 
- 20 
https://ntrs.nasa.gov/search.jsp?R=20080012372 2019-08-30T03:56:43+00:00Z
US. Patent MU. 7,1989 
“7 
Sheet 1 of 9 4,811,214 
U.S. Patent MU. 7,1989 
7 
40 
1 
42 d 
I 
Sheet 2 of 9 4,811,214 
FIG. 2 
46 r /-- 
+>- 
I - - 30 
1 2 8  
1 LOCAL NODE BUS 
:OR DATA 6 ADDRESS 
- -’ iIGH-SPEED BUS 
I 
I 
r 
t 
I 
r34 
MEMORY-ALU SWITCH 
(MASNETI 
16 x 16 NONBLOCKING 
I.I/INTERNAL MEMORY 
I 
k DATAPATHS 
IN ( k 5 8 )  
PIPELINE 
480Mf l o p  
12 
U.S. Patent Mar. 7,1989 Sheet 3 of 9 4,811,214 
N 
0 Po 
US. Patent MW. 7,1989 Sheet 4 of 9 
FIG. 4 
OPERANDS ( INPUTS 1 
4,811,214 
I 
1 
i 
I 
I 
I 
i 
i 
I 
I 
I 
RESULTS ( OUTPUTS ) 
U.S. Patent MU. 7,1989 
I 
m 
ln 
Sheet 5 of 9 4,811,214 
-3- 
c 
x e' 
,m 
1 )  
I I 
N W
- 
+ 
3 
0 
cn 
I + 
4 
4 
c U
0 
d 
a 
CD c m 
U.S. Patent MU. 7,1989 
t rn W
Sheet 6 of 9 4,811,214 
c 
2 
u3 
v W -7 W 
o 
u3 
N 
'i,L 
N 
u3 
,'1 cu 
I 
t 
US. Patent MW. 7,1989 Sheet 7 of 9 4,811,214 
L Y  
t72 flV ONV 82 htiOW3W WOiJ 
U.S. Patent M=. 7,1989 Sheet 8 of 9 4,811,214 
U.S. Patent ~ m .  7,1989 Sheet 9 of 9 4,811,214 
FIG. 8 
1 J78 
4,811,214 
1 2 
“fire” (i.e. execute) until all operands are present. The 
MULTINODE RECONFIGURABLE PIPELINE present invention include the concept of data flowing 
COMPUTER through a pipeline network of hardware functional units 
that perform operations on data (e.g. act as instructions 
that process data). However, by contrast, the present GOVERNMENT RIGHTS 
This invention was made with Goverment support invention does not function in an asynchronous mode. 
under control-NAG-1-494 awarded by NASA. The Instead, data is fetched from memory and is routed by a 
Goverment has certain rights in this invention. switch (MASNET) to pipelined instruction units 
BACKGROUND OF THE INVENTION through the centralized control of a very high speed 
microsequencing unit. This synchronous control se- 
quence is in sharp contrast to the asynchronous distnb- 1. Field of the Invention The invention relates to a computer formed of many uted data routing invoked by the Data Flow architec- nodes in which each of the nodes includes a reconfigu- 
rable, many-function ALU pipeline connected to multi- ture. 
ple, independent memory planes through a multi-func- 15 Moreover, the present invention, unlike the Data- 
tion memory-ALU network switch (MASNET) Flow Machine, has no token field (i.e. a data field that 
the multiple nodes are connected in a hypercube topol- guides the data to the approriate functional unit) nor do 
OgY. the functional units have queues (Le. buffers that hold 
2. Description of Related Art operands, instructions, or results). The Data-Flow Ma- 
The computer of the present invention is both a paral- 2o chine has functional units waiting for data. The present 
le1 and a pipelined machine. The prior art does disclose invention has functional units that are continuously 
and Pipelining. See? for example* pat- No. tion is achieved by a central controller, referred to as a 
4,589,067. However, the internal architecture of the microsequencer, whereas the Data-Flow Machine uSeS 
present invention is unique in that it allows for most, if 25 distributed control. The present invention also has the 
neously active. U.S. Pat. No. 4,589,067 is typical of the 
prior art in that it describes a vector processor based data using the TAG field, a feature not found in Data- 
upon a dynamically reconfigurable ALU pipeline. niS Flow machine. Furthermore, the Data-Flow computer 
processor is similar to a single functional unit of the 30 does not effectively perform series of like or dissimilar 
present invention’s reconfigurable pipeline. In one S e w  computations On continuous streams Of vector data (i.e. 
the pipeline of the present invention’s node is thus a a single functional operation on all data flowing 
pipeline of pipelines. Other structures that possibly through the pipeline). In contrast the present invention 
merit comparison with the present invention are the performs this operation quite naturally. 
Systolic Array by Kung, The MIT Data-Flow Concept 35 There are two other principal differences between 
and the concept of other parallel architectures. the parallel architecture of the present invention and 
The Systolic Array concept by H. T. Kung of other parallel architectures. First, each node of the 
gie Melon University involves data which is “pumped” present invention involves a unique memory/processot 
(i.e. flows) through the computer as ‘‘waves”. Unlike design (structure). Other parallel architectures involve 
the present invention, the Systolic Array system is corn- 40 existing stand-alone computer architectures 
Of homogenous where each for interconnection with neighboring nodes. Second, building block performs a given operation. In the Sys- 
connection between identical building blocks use a central processing unit to oversee and control 
fixed during a  computation^ best, the configuration 45 interprocessor communications so that local processing 
cannot be changed until all data is processed by the is suspended during global communications. The nodes 
Systolic Array. In the present invention, by contrast, of the present invention, by contrast, use an interproces- 
the interconnection &tween building blocks can be sor router and cache memory which allows for commu- 
changed at any time, even when data is passing through nications without disturbing local processing of data. 
the pipeline (i.e. dynamic reconfiguration of intercon- 50 The following U.S. Patents discuss programmable or 
nects). The present invention is also distinct from the reconfigurable pipeline processors: 3,787,673; 3,875,39 1; 
Systolic Array concept in that each building block (i.e. 3,990,732; 3,978,452; 4,161,036; 4,225,920 4,228,497; 
fmctional unit) of the node pipeline of the present in- 4,307,447; 4,454,489; 4,467,409; and 4,482,953. A useful 
vention can perform a different operation from its discussion of the history of both programmable and 
neighbors (e.g. functional unit 1 - floating point multi- 55 non-progrmmble pipeline is found in col- 
ply; functional unit 2-integer minus; functional unit 3 - umns 1 through 4 of U.S. Pat. No. 4,594,655, In a&& 
logical compare, etc.). In addition, during the course of tion, mother 
computation, each building block of the present inven- is found in the article 
entitled PROGRAMMING OF PIPELINED PRO- tion can assume different functionalities (i.e. reconfigu- ration of functionality). 
60 CESSORS by Peter M. Kogge from the March 1977 The MIT Data-Flow computer is comprised of a 
network of hardware-invoked instructions that may be edition Of pages 
connected in a pipeline arrangement. The instruction 
processing is asynchronous to the “data-flow”. Each Lastly, the fOllOWing U.S. Patents are cited for their 
data word is appended with a field of token bits which 65 general discussion of pipelined processors: 4,05 1,55 1; 
determines the routing of the data to the appropriate 4,101,960; 4,174,514; 4,244,019; 4,270,181; 4,363,094; 
data instruction units. Each instruction unit has a data 4,438,494; 4,442,498; 4,454,578; 4,491,020; 4,498,134 and 
queue for each operand input. The instruction does not 4,507,728. 
in certain limited contexts the concept Of parallelism active. The control of the pipeline of the present inven- 
not Of the being sirnulta- ability to reconfigure itself based upon internal flow of 
tolic Array computer, as data flows through, the inter- Other general mu’tiple-processors/parallel computers 
discussion of the early efforts 
pipeline 
63-69. 
4.811,214 
3 
SUMMARY OF THE INVENTION 
Briefly described, the present invention uses a small 
number (e.g. 128) of powerful nodes operating concurr- 
rently. The individual nodes need not be, but could be, 
synchronized. By limiting the number of nodes, the 
total communications and related hardware and soft- 
ware that is required to solve any given problem is kept 
to a manageable level, while at the same time, using to 
advantage the gain and speed and capacity that is inher- 
ent with concurrency. In addition, the interprocessor 
communications between nodes of the present invention 
that do occur, do not interrupt the local processing of 
data within the node. These features provide for a very 
efficient means of processing large amounts of data 
rapidly. Each node of the present invention is compara- 
ble to the speed and performance to Class VI supercom- 
puters (e.g. Cray 2 Cyber 205, etc.). Within a given 
node the computer uses many (e.g. 30) functional units 
(e.g. floating point arithmetic processors, integer arith- 
meticllogic processors, special-purpose processors, 
etc.) organized in a synchronous, dynamically-recon- 
figurable pipeline such that most, if not all, of the func- 
tional units are active during each clock cycle of a given 
node. This architectural design serves to minimize the 
storage of intermediate results in memory and assures 
that the sustained speed of typical calculation is close to 
the peak speed of the machine. This, for example, is not 
the case with existing Class VI supercomputers where 
the actual sustained speed for a given computation is 
much less than the peak speed of the machine. In addi- 
tion, the invention further provides for flexible and 
general interconnection between the multiple planes of 
memory, the dynamically reconfigurable pipeline, and 
the interprocessor data routers. 
Each node of the present invention includes a recon- 
figurable arithmeticfiogic unit (ALU), a multiplane 
memory and a memory-ALU network (MASNET) 
switch for routing data between the memory planes and 
the reconfigurable ALU. Each node also includes a 
microsequencer and a microcontroller for directing the 
timing and nature of the computations within each 
node. Communication between nodes is controlled by a 
plurality of hyperspace routers. A front end computer 
associated with significant off-line mass storage pro- 
vides the input instructions to the multi-node computer. 
The preferred connection topology of the node is that 
of a boolean hypercube. 
The reconfigurable ALU pipeline within each node 
preferably comprises pipeline processing elements in- 
cluding floating-point processors, integernogic proces- 
sors and special-purpose elements. The processing ele- 
ments are wired into substructures that are known to 
appear frequently in many user applications. Three 
hardwired substructures appear frequently within the 
reconfigurable ALU pipeline. One substructure com- 
prises a two element unit, another comprises a three-ele- 
ment unit and the last substructure comprises a one-ele- 
ment unit. The three-element substructure is found typi- 
cally twice as frequently as the two element substruc- 
ture and the two element substructure is found typically 
twice as frequently as the one element substructure. The 
judicious use of those substructures helps to reduce the 
complexity of the switching network employed to con- 
trol the configuration of the ALU pipeline. 
The invention will be further understood by refer- 
ence to the following drawings. 
LO 
25 
30 
35 
40 
45 
50 
35  
60 
65 
4 
BRIEF DESCRIPTION O F  THE DRAWINGS 
FIG. 1 illustrates an embodiment of the multinode 
computer arranged in a two-dimension nearest-neigh- 
bor grid which is a subset of the boolean hypercube. 
FIG. 2 is a schematic diagram of an individual node 
illustrating the memoryNASNET/ALU circuit inter- 
connections. 
FIG. 3 is a schematic diagram illustrating the layout 
of one memory plane within a single node such as illus- 
trated in FIG. 2. 
FIG. 4 illustrates two typical substructures formed 
from five arithmeticAogic units as might be found 
within the reconfigurable ALU pipeline of each node. 
FIG. 5A illustrates a typical ALU pipeline organiza- 
tion and the switching network (FLONET) which al- 
lows for a change in configuration of the substructures. 
FIG. 5B illustrates a preferred embodiment of the 
interconnection of a FLONET to a grouping of the 
three common substructures in a reconfigurable ALU 
pipeline. 
FIG. 6 is, a schematic diagram of a 32-registerx n-bit, 
memory/ALU network switch (MASNET) and inter- 
node communications unit where the blocks represent 
six port register files. 
FIG. 7 is a schematic diagram of a 2 X 2 MASNET 
which illustrates how the input data stream can source 
two output data streams with a relative shift of "p" 
elements. 
FIG. 8 is a schematic diagram of an 8-node hyper- 
cube showing the relationship of the hyperspace routers 
to the MASNET units of each node. 
DETAILED DESCRIPTION OF THE 
INVENTION 
During the course of this description, like numbers 
will be used to identify like elements according to the 
different figures which illustrate the invention. 
The computer 10 according to the preferred embodi- 
ment of the invention illustrated in FIG. 1 includes a 
plurality of multiple memory/computational units re- 
ferred to as nodes 12. Computer 10 is of the parallel- 
processing variety capable of performing arithmetic 
and logical operations with high vector and scaler effi- 
ciency and speed. Such a device is capable of solving a 
wide range of computational problems. Each node 12 is 
connected via drop-line network 18 to a front end com- 
puter 16 that provides a host environment suitable for 
multi-user program development, multinode initializa- 
tion and operation, and off-line data manipulation. 
Front-end computer 16 is connected to an offline mass 
storage unit 20 by interconnection 22. Each node 12 is 
also connected to adjacent nodes by internode connec- 
tions 14. For purposes of clarity and illustration, only 25 
nodes 12 are illustrated with simple internode links 14 in 
FIG. 1. However, it will be appreciated that the nodes 
12 can be connected in a general hypercube configura- 
tion and that the invention may comprise fewer or more 
than 128 nodes as the application requires. Rather than 
interconnect a large number of relatively slow micro- 
processors, as is done with other prior art parallel com- 
puters, the present invention incorporates a relatively 
small number of interconnected, large-capacity, high- 
speed powerful nodes 12. According to the preferred 
embodiment of the present invention, the configuration 
typically consists of between I and 128 nodes 12. This 
approach limits the number of physical and logical in- 
terconnects 14 between nodes 12. The preferred con- 
5 
498 
nection topology is that of a boolean hypercube. Each 
of the nodes 12 of the computer 10 is comparable to a 
class VI supercomputer in processing speed and capac- 
The details of a typical individual node 12 are illus- 
trated in FIG. 2. Each node 12, which is the building 
block of the computer 10, is comprised of five ( 5 )  basic 
elements, namely: (1) a reconfigurable ALU pipeline 24 
having many (e.g. 9 or more) high-performance and 
ity. 
1,214 
6 
Each memory plane 30 can be enabled for read-only, 
write-only or read/write operations. The memory 
planes 30 support three possible addressing modes, 
namely: (1) direct, (2) translate and (3) computed. With 
5 all three modes, the working address is prefetched by 
prefetch address register 52 on the previous cycle of the 
computer 10. In the direct mode, the address from the 
microsequencer address bus 46 is used to select the 
memory element of interest. In the translate mode, the 
special-purpose elements 62 (2) a group 28 of indepen- 10 microsequencer address is used to look up the actual 
dent memory planes 30, (3) a non-blocking multiplein- address in a large memory table of addresses. This large 
put and multiple-output switching MASNET table of addresses is stored in a separate memory unit 
(Memory/ALU Switch network) 26, (4) a mi- referred to as the translate memory bank or table 50. 
crosequencer 40 and (5) a microcontroller 42. FIG. 2 The translate table 50 can be used to generate an arbi- 
illustrates such a node 12 which includes 8 memory I5 trary scan pattern through main memory bank 54. It can 
planes 30 connected to a reconfigurable pipeline 24 by also be used to protect certain designated memory ele- 
memory-ALU network switch (MASNET) 26. As used ments from ever being over-written. The computed 
in this description the terms “processing elements”, address mode allows the pipeline 24 to define the ad- 
“functional unit”, “programmable processors” and dress of the next sourced or sinked data word. 
“building blocks” refer to arithmeticAogic units 62 20 Reconfigurable pipeline 24 is formed of various pro- 
which comprise either floating point arithmetic proces- cessing elements shown as units 62 in FIG. 4 and a 
son, integer/arithmeticflogic processors, special-pur- switch network shown as FLONET 70 in FIGS. 5A 
pose processors or a combination of the foregoing. and SB (FLONET is an abbreviation for Functional and 
Microsequencer 40 is connected via lines 46 to mem- Logical Organization NETwork). Three (3) perma- 
ory 28, MASNET 26 and reconfigurable ALU pipeline 25 nently hardwired substructures or units 62,64 or 66 are 
24 respectively. Similarly, microcontroller 42 is con- connected to FLONET. FLONET 70 reconfigures the 
nected to the same elements via lines 44. Mi- wiring of the pipelined substructures 62,64 and 66 illus- 
crosequencer 40 governs the clocking of data between trated collectively as 68 in FIG. 5A and 69 in FIG. 5B. 
and within the various elements and serves to define The specialized reconfigurable interconnection is 
data pathways and the configuration of pipeline 24 for 30 achieved b electronic switches so that new configura- 
each clock tick of the node 12. In a typical operation, a tions can be defined within a clock period of the node 
new set of operands is presented to the pipeline 24 and 12. An example of high-level data processing in a spe- 
a new set of results is derived from the pipeline 24 on cific situation is shown in FIG. 4. The pipeline process- 
every clock of the node 12. Microsequencer 40 is re- ing elements include floating-point arithmetic proces- 
sponsible for the selection of the microcode that defines 35 sors (e.g. AMD 29325, Weitek 1032/1033), integer 
the configuration of pipeline 24, MASNET 26 and arithmeticflogic units 62 (e.g. AMD 29338 and spe- 
memory planes 30. In typical operation, the addresses cial-purpose elements such as vector regeneration units 
increase sequentially in each clock period from a spe- and convergence checkers. A useful discussion related 
cific start address until a specified end address is to the foregoing special-purpose elements can be found 
reached. The address ramp is repeated continually until 40 in an article entitled “Two-Dimensional, Non Steady 
an end-of-computation interrupt flag is issued. The ac- Viscous Flow Simulation on the Navier Stokes Com- 
tual memory address used by a given plane 30 of mem- puter MiniNode”, J. Sci Compute, Vol. 1, No. 1 (1986) 
ory 28 may differ from the microsequencer 40 address by D. M. Nosenchuck, M. G. Littman and W. Flannery. 
depending upon the addressing mode selected. (See Processing elements 62 are wired together in three (3) 
discussion concerning memory planes below). 45 distinct substructures 62,64 and 66 that have been found 
Microcontroller 42, also referred to as a node man- to appear frequently in many user application programs. 
ager, is used to initialize and provide verification of the Two of the most commonly used substructures 64 and 
various parts of the node 12. For a given computation, 66 are shown by the elements enclosed in dotted lines in 
after the initial set up, control is passed to the mi- FIG. 4. Substructure 64 comprises three ALU units 62 
crosequencer 40 which takes over until the computation 50 having four inputs and one output. Two ALU units 62 
is complete. In principal, microcontroller 42 does not accept the four inputs in two pairs of twos. The outputs 
need to be active during the time that computations are of the two ALU units 62 form the two inputs to the 
being performed although in a typical operation the third ALU unit 62. Each of the three ALU units 62 are 
microcontroller 42 would be monitoring the progress of capable of performing floating point and interger addi- 
the computation and preparing unused parts of the com- 55 tion, subtraction, multiplication and division, logical 
puter for the next computation. AND, OR, NOT exclusive OR, mask, shift, and com- 
In addition to the five basic elements which constitute pare functions with a logical register file used to store 
a minimal node 12, each node 12 ma be expanded to constants. Substructure 66 comprises two arithmetic/- 
include local mass storage units, graphic processors, logic units 62 and is adapted to provide three inputs and 
pre-and post-processors, auxilliary data routers, and the M) one output. One of the two arithmeticflogic units 62 
like. Each node 12 is operable in a stand-alone mode accepts two inputs and produces one output that forms 
because the node manager 42 is a standalone microcom- one input to the second arithmeticflogic unit 62. The 
put-. However, in the normal case the node 12 would other input to the second arithmeticAogic unit 62 
be programmed from the front-end computer 16. comes directly from the outside. The single output of 
The layout of a single memory plane 30 is schemati- 65 substructure 66 comes from the second arithmeticflogic 
cally illustrated in FIG. 3 Memory planes 30 are of high unit 62. Accordingly, substructure 66 comprises a three 
capacity and are capable of sourcing (reading) or sink- input and one output device. The third and last most 
ing (writing) a data word in a clock of the machine 10. common substructure is an individual arithmeticflogic 
7 
4,811,214 
8 
unit 62 standing alone, i.e. two inputs and one output. is also a feature of MASNET 26. FIG. 7 illustrates more 
Substructures 62,64 and 66 are permanently hardwired explicitly how a 2 x 2  MASNET (i.e. a single register 
into those respective configurations, however, the re- file 72) can achieve both of these simple tasks. 
configuration among those units is controlled by FLO- MASNET 26 is used also for internode commumca- 
NET 70. A simplified FLONET 70 is schematically 5 tions in that it routes data words corresponding to the 
represented in FIG. SA. For simplicity, two three-ele- nodal boundaries to bordering nodes 12 through hyper- 
ment substructures 64, tWO two-element substructures space routem 80. This routing is achieved % the data 
66 and two one-element substructures 62 are illustrated. flows through the MASNET 26 without the introduc- 
This results in a twelve-functional Unit, high-level tion of any additional delays. Likewise, the hyperspace 
reconfigurable pipeline 24. 10 router 80 of a given node 12 can inject needed boundary 
FIG. SB illustrates an optimal layout of a FLONET- point values into the data stream as they are needed 
/ALLJ interCmm~t. According to the Preferred em- without the introduction of any delays. A more detailed 
bodiment of the invention 10, the optimal ratio between discusion of internode communications follows, 
the three-element substructures 64 and the two element The global topo~ogy of the muitinode computer 10 ,s 
substructures 66 is in the range of 1.5 to 2.0 to 1.0 15 that o f a  hypercube. The hypercube represents a com- 
(1.5-2.O:l). Likewise the optimal ratio between the two promise between the time required for arbitrary inter- 
element substructures 66 and the single-element sub- node comm~cations, the number of physical in- 
FIG. 5B illustrates the optimal scenario which includes modes support internode data communications, namely: 
eight three-element substructures 64, four two-element 20 (1) global and (2) explicit boundary-polnt 
substructures 66 and two single-element substructures definition, or BPD. Global addressing is simply ex- 
62. The number of three element substructures 64 could tended addressing, where an address specifies the 
structures 62 is approximately 2 to 1 (2:1)* Accordingly, terconnections between nodes 12. Two addressing 
vary between and according to the embodiment nodel'memory-plane/offset of the data. From a soft- 
illustrated in ware standpoint, the address is treated * a simple linear 
scribed are approximate and might slightly from 25 address whose range extends across all node  in the 
5B- The preferred ratios just de- 
application to application. However, it has been found 
mal results. 
computer Internode communications is handled by 
if default arbitration and communications-lock parame- 
that the foregoing ratios do provide to 'pti- 
According to the Preferred embodiment Of the inven- 
software and is entirely transparent to the programmer 
terS we chosen. BPD involves the explicit definition of 
and 66 in 30 boundary points, their source, and all destination ad- tion the grouping 69 Of substructure 62? FIG. 5B have the functional units, or building blocks, 
62 organized in the following manner: each of the three 
eight substructures 64 would be floating point proces- 
would have ach of their two functional units 62 in the 
whereas the remaining two substructures 66 would 
have integerflogic processors like the AMD 29332; 
lastly one of the remaining single functional units 62 40 'Odes' 
29325 the other remaining single functional unit 62 
would be an integer logic processor like the AMD 
dresses. Whenever BPD data is generated, it is immedi. 
ately routed to BDP caches s2 in the destination nodes 
may be intermixed. The main advantage of global ad- 
BPD has the of eliminating internode 
boundary-point data before they are requested by other 
function units 62 (i'e* Prorammable processors) in the 12 illustrated in FIG. 8. Local &dressing and BpD 
sors like the AMD 29325; two of substructures 66 35 dressing Over BPD is software simplicity, although 
form Of flodng point processors like the AMD 29325 communications overhead by precommunlcat,ng 
would be a floating point processor like the AMD Data are physica1ly routed between 'Odes l2 using local switching networks attached to each node 12. The 
local switching networks previously referred to as hy- 
29332. Alternatively, it is also possible to pair proces- perspace 8o are in '. 
son to form a hybrid functional unit 62. For example, a 45 'pace routem 8o are non-blocking permutation net- 
floating point processor like the AMD 29325 could be to the Benes network. 
the with integer logic processor like the AMD N N = 2 d r  NN=number Of nodes), the hyperspace 
works with a 
in a manner known to those of ordinary skill in For a multinode class computer or order d (Le., 
29332 so that the f u n c t i o ~  unit 62 can alternate be- router permits d +  1 inputs which includes d neighbor- 
tween floating point and integerflogic. is also possible 50 ing nodes 12 plus one additional input for the host node 
to use a single many-function processor (floating point 12. The data are self-routing in that the destination 
integer arithmeticflogic) like the Weitek address, carried with the data, is used to establish hyper- 
3332 to activate the same result. space router switch states. An eight node system IS ne details of a MASNET 26 ( M ~ ~ ~ ~ ~  ~1~ Switch illustrated in FIG. 8. In this example, d=3, and each 
NETworkIare shown in detail with sixteen inputs and 55 hyperspace router 80 has a 4 x 4  network with a delay of 
sixteen outputs in FIG. 6. MASNET 26 is made up of three minor Cb~ks .  For 3<d<8 where small d is an 
register files 72 (e.g. Weitek 1066) that are cross con- integer, an 8 X 8 router 80 is required, with d = 7 provid- 
netted in a B~~~ switching network arrangement and ing complete switch utilization. Since the hyperspace 
pipelined SO as to make the connection of any input to router 8 must be configured for In2 d =  1 inputs, optimal 
any output non-blocking. The MASNET 26 illustrated 60 hardware Performance is given by a computer array 
in FIG. 6 is a sixteen-by-sixteen (16x 16) circuit. The having the size of 
fact that each register file 72 has local memory also 
means that by using the MASNET 26 it is possible to 
reorder data as it flows through the network. This fea- 
ture can be used, for example, to create two data 65 
streams from a common source in which one is delayed 
with respect to the other by several elements. The for- 
mation of multiple data streams from a common source 
N N  = +, n = D, I ,  2.3 
Configurations of 1, 2, 8, 128, . . . nodes fully utilize 
the hyperspace routers 80. Multinode computer config- 
urations with non-integer 1x12 d are also supported, ex- 
431 1 
9 
cept the hyperspace router 80 is scaled up to the next 
integral dimension. The implications of this are not 
severe, in that aside from the penalty of additional 
switch hardware, a slightly greater amount of storage is 
required for the permutation tables. The node stores 5 
these tables in a high-speed look up table. The length of 
the table is (df l)?. When the computer grows beyond 
128 nodes, the hyperspace router increases to a 1 6 ~  16 
switch. Since the look-up tables become prohibitively 
large, the permutation routing is then accomplished by 10 
bit-slice hardware which is somewhat slower than the 
look-up tables. These considerations have established 
128 nodes as the initial, preferred computer configura- 
tion. 
fiber-optic cables in byte-serial format at a duplex rate 
of 1 Gbyte/second. This rate provides approximately 
two orders-of-magnitude head room for occasional 
burst transmissions and also for future computer expan- 
sion. Each node 12 has a 1 Mword boundary-point 20 
write-through cache which, in the absence of host-node 
requests for cache bus cycles is continuously updated 
by the hyperspace router 80. Thus, current boundary 
data are maintained physically and logically olose to the 
ALU pipeline inputs. 25 
While the invention has been described with refer- 
ence to the preferred embodiment thereof it will be 
appreciated that various modifications can be made to 
the parts and methods that comprise the invention with- 
out departing from the spirit and scope thereof. 
Data transmission between nodes 12 occurs over 15 
30 
We claim: 
1. A multi-node, parallel processing computer appa- 
a plurality of nodes each including an internal mem- 
ory and a reconfigurable arithmetic logic (ALU) 35 
pipeline unit and a memory/ALU/switch network 
(MASNET) for transferring data from said internal 
memory through said MASNET to said reconfigu- 
rable ALU pipeline unit and from said reconfigura- 
ble ALU pipeline unit through said MASNET to 40 
said internal memory, said reconfigurable ALU 
pipeline unit further including a first group of pro- 
grammable processors permenantly connected to- 
gether in a first configuration having four (4) inputs 
and one (1) output and a second group of program- 45 
mable processors permanently connected together 
in a second configuration different from said first 
configuration, said second group having three (3) 
inputs and one ( I )  output, and an ALU pipeline 
configuration switching network means (FLO- 50 
NET) for selectively connecting said first and sec- 
ond groups to each other, and sequencer means for 
providing instructions to said FLONET once a 
clock cycle; and, 
ratus comprising: 
router means for routing data between said nodes, 
wherein said recodigurable ALU pipeline unit selec- 
tively performs different computations according 
to instructions from said sequencer means once a 
clock cycle. 
2. A reconfigurable computer apparatus comprising: 60 
a first group of programmable processors perma- 
nently connected together in a first configuration 
having four (4) inputs and one (1) output, said first 
group including a first programmable processor 
having at least two (2) inputs and at least one (1) 65 
output; a second programmable processor having 
at least two (2) inputs and at least one (1) output; 
and, a third programmable processor having two 
55 
,214 
10 
(2) inputs permanently connected to the outputs of 
said first and second programmable processors, 
said third programmable processor also having an 
output, such chat the four inputs of said first group 
comprise the inputs of said first and second pro- 
grammable processors and the output of said first 
group comprises the output of said third program- 
mable processor; 
a second group of programmable processors perma- 
nently connected together in a second configura- 
tion different from said first configuration, said 
second group having three (3) inputs and one 91) 
output and including a fourth programmable pro- 
cessor having two (2) inputs and one (1) output; 
and, a fifth programmable processor having two (2) 
input and one (1) output, one of said inputs of said 
fifth programmable processor being permanently 
connected to the output of said fourth programma- 
ble processor, such that the three (3) inputs of said 
second group comprise the two (2) inputs to said 
fourth programmable processor and the input to 
said fifth programmable processor not connected 
to the output of said fourth programmable proces- 
sor, and the output of said second group compns- 
ing the output of said fifth programmable proces- 
sor; 
a third group or programmable processors compris- 
ing individual processors having two (2) inputs and 
one (1) output; 
switching means (FLONET) for selectively connect- 
ing said first, second and third groups together; 
and, 
sequencer means for providing instructions to said 
FLONET once a clock cycle, 
wherein said apparatus selectively performs different 
computations according to instructions from said 
sequencer means once a clock cycle. 
3. A reconfigurable computer apparatus including 
arithmeticflogic units (ALU), said apparatus compris- 
ing: 
at least a first substructure including three (3) ALU 
units permanently connected together in a first 
configuration having four (4) inputs and one (1) 
output; 
at least a second substructure including two (2) ALU 
units permanently connected together in a second 
configuration having three (3) inputs and one ( 1 )  
output; 
at least a third substructure including at least one 
individual ALU unit having two (2) outputs and 
one (1) output; 
switching means for selectively connecting said first, 
second and third substructure together; and, 
sequencer means for providing instructions to said 
switching means, 
wherein said apparatus selectively performs compu- 
tations according to instructions from said se- 
quencer means. 
4. A node apparatus for use in a multi-node, parallel 
an internal memory including a plurality of memory 
planes; 
a dynamically reconfigurable arithmetic logic (ALU) 
pipeline means for performing computations, in- 
cluding a plurality of ALUs at least three of which 
are permanently connected to each other; 
an ALU pipeline configuration switching network 
means (FLONET) for selectively connecting 
processing system, said node apparatus comprising: 
4.81 1,214 I -  
ll 
groups of said ALUs in said dynamically recon- 
figurable arithmetic logic pipeline means together; 
a memory/ALU/switch network (dASNET) for 
transferring data from the memory planes of said 
internal memory through said MASNET to said 
dynamically reconfigurable ALU pipeline means 
and from said dynamically reconfigurable ALU 
pipeline means through said MASNET to said 
internal memory; and, 
sequencer means for providing instructions to said 
FLONET, 
wherein said dynamically reconfigurable ALU pipe- 
line means selectively performs different computa- 
tions according to instructions from said sequencer 
means. 
5. The apparatus of claim 29 wherein said first group 
a first programmable processor haying at least two 
(2) inputs and at least one (1) output; 
a second programmable processor having at least two 
(2) inputs and at least one (1) output; and, 
a third programmable processor having two (2) in- 
puts permanently connected to the outputs of said 
first and said second programmable processors, 
said third programmable processor also having an 
output, 
wherein the inputs to said first group comprise the 
inputs of said first and second programmable pro- 
cessors and the output of said fust group comprises 
the output of said third programmable processor. 
6. The apparatus of claim 5 wherein said second 
a fourth programmable processor having at least two 
(2) inputs and at least one (1) output; and, 
a fifth programmable processor having two (2) inputs 
and one (1) output, one of said inputs of said fifth 
programmable processor being permanently con- 
nected to the output of said fourth programmable 
processor, 
wherein the inputs of said second group comprise the 
two inputs to said fourth programmable processor 
and the one input to said fifth programmable pro- 
cessor not connected to the output of said fourth 
programmable processor, and the output of said 
second group comprises the output of said fifth 
programmable processor. 
7. The apparatus of claim 6 wherein said reconfigura- 
a third group of programmable processors compris- 
ing individual programmable processors connected 
to said FLONET for selective connection with 
said first and second groups of programmable pro- 
8. The apparatus of claim 7 wherein the ratio of said 
first group of programmable processors with respect to 
said second group of programmable processors in a 
given reconfigurable ALU pipeline unit is approxi- 
mately in the range of 1.5-2.0 to 1.0. 
9. The apparatus of claim 8 wherein the ratio of said 
second group of programmable processors to said third 
group of programmable processors is approximately 2.0 
to 1.0. 
10. The apparatus of claim 9 wherein said internal 
memory comprises a plurality of memory planes. 
11. The apparatus of claim 10 wherein each memory 
plane comprises: 
a main memory bank 
an address multiplexer for transmitting data to and 
of programmable processors comprises: 
group of programmable processors comprise: 
ble ALU pipeline unit further comprises: 
cessors. 
from said main memory bank; 
a prefetch address register connected between said 
main memory bank and said address multiplexer; 
and, 
a translate table means connected to said address 
multiplexer for scanning said assembly bank in a 
random access manner. 
12. The apparatus of claim 11 wherein said sequencer 
microsequencer means connected to said internal 
memory, MASNET and reconfigurable ALU pipe- 
line unit for governing the clocking of data be- 
tween said internal memory, MASNET and said 
reconfigurable ALU pipeline unit. 
13. The apparatus of claim 12 wherein each node 
a microcontroller connected to said internal memory, 
MASNET and said reconfigurable ALU pipeline 
unit for initializing and verifying the status of said 
internal memory, MASNET and reconfigurable 
ALU pipeline. 
14. The apparatus of claim 13 wherein said MAS- 
a plurality of register files cross connected in a Benes 
switching network arrangement and pipelined so as 
to make the connection of any input to any output 
non-blocking. 
15. The apparatus of claim 14 further comprising: 
boundary-point definition (BPD) cache means con- 
nected between said router means and said MAS- 
NET for routing BPD data to specific destination 
nodes, 
wherein said apparatus supports both global address- 
ing and explicit BPD addressing modes. 
16. The apparatus of claim 15 further comprising: 
a front end computer for feeding data and instruc- 
tions to said nodes; and, 
off-line mass storage means connectable to said front 
17. The apparatus of claim 16 wherein said nodes are 
connected together in the topology of a boolean hyper- 
cube and vary in number in the range of from 1 to 128. 
18. The apparatus of claim 2 further comprising: 
an internal memory; and, 
a memory-ALU switch network means (MASNET) 
for transferring data from said internal memory 
through said MASNET to said switching means 
and for transferring data from said switching means 
through said MASNET to said internal memory. 
19. The apparatus of claim 18 wherein said sequences 
microsequencer means connected to said internal 
memory, MASNET and switching means for gov- 
erning the clocking of data between said internal 
memory, MASNET and switching means. 
20. The apparatus of claim 19 further comprising: 
microcontroller means connected to said internal 
memory, MASNET and switching means for ini- 
tializing and verifying the status of said internal 
memory, MASNET and switching means. 
21. The apparatus of claim 2 wherein at least some of 
said processors comprise floating point arithmetic pro- 
cessors. 
22. The apparatus of claim 2 wherein at least some of 
said processors comprise integer arithmetic logic pro- 
23. The apparatus of claim 2 wherein the ratio of said 
first group of programmable processors with respect to 
said second group of programmable processors is ap- 
proximately in the range of 1.5-2.0 to 1.0. 
24. The apparatus of claim 2 wherein the ratio of said 
65 second group of programmable processors to said third 
group of programmable processors is approximately 2.0 
to 1.0. 
means further comprises: 
10 
further comprises: 
20 NET comprises: 
25 
30 
35 end computer. 
40 
45 
means further comprises: 
50 
55 
60 cessors. 
* * * . *  
