The Coming Age of Parallel-Processing Supercomputer by Nosenchuck, Daniel M. & Littman, Michael G.
The Space Congress® Proceedings 1986 (23rd) Developing Space For Tomorrow's Society 
Apr 1st, 8:00 AM 
The Coming Age of Parallel-Processing Supercomputer 
Daniel M. Nosenchuck 
Department of Mechanical Aerospace Engineering Princeton University Princeton, NJ 08544 
Michael G. Littman 
Department of Mechanical and Aerospace Engineering Princeton University Princeton, NJ 08544 
Follow this and additional works at: https://commons.erau.edu/space-congress-proceedings 
Scholarly Commons Citation 
Nosenchuck, Daniel M. and Littman, Michael G., "The Coming Age of Parallel-Processing Supercomputer" 
(1986). The Space Congress® Proceedings. 5. 
https://commons.erau.edu/space-congress-proceedings/proceedings-1986-23rd/session-2/5 
This Event is brought to you for free and open access by 
the Conferences at Scholarly Commons. It has been 
accepted for inclusion in The Space Congress® 
Proceedings by an authorized administrator of Scholarly 
Commons. For more information, please contact 
commons@erau.edu. 
THE COMING AGE OF THE 
PARALLEL-PROCESSING SUPERCOMPUTER
Daniel M. Nosenchuck 
Department of Mechanical 
Aerospace Engineering 
Princeton University 
Princeton, NJ 08544
and
Michael G. Littman 
Department of Mechanical and 
Aerospace Engineering 
Princeton University 
Princeton, NJ 08544
ABSTRACT
It is anticipated that the needs of scientific 
computation will dramatically outpace the per- formance of general-purpose supercomputers over 
the next decade. These needs will, however, be addressed by an emerging class of parallel- 
processing supercomputers (PPS). The Princeton 
University Navier-Stokes Computer (NSC) is a 
PPS geared toward simulating complex flows. 
It has a projected speed and capacity two 
orders of magnitude beyond that of current 
supercomputers. The architecture of the NSC 
and a discussion of a working prototype is 
presented.
INTRODUCTION
Most problems-of-interest in aeronautics and 
astronautics are quite complex and do not 
yield to known analytical techniques. "Solu- tions" are often obtained after conducting 
many experimental tests in the laboratory. 
However, due to the limitations of modeling, 
the results of these tests are approximate. 
In addition, there are wide classes of prob- lems involving dynamically changing conditions 
which cannot adequately be simulated experiment- 
ally. Therefore, one must resort to numerical 
simulations on a digital computer in an attempt to predict the behavior of a system under study 
or design.
Problems that fall into the category of being 
candidates for numerical treatment are most problems in fluid mechanics, including flow over 
complex bodies, turbulence, separation, drag and 
overall dynamic flow behavior. In the related 
area of combustion, problems dealing with ad- 
vanced compressor and fuel delivery designs, 
along with the details of the combustion pro- 
cess itself, cannot be readily dealt with ex- perimentally. Other fields, including mate- 
rials and structural design, along with complex 
mission forecasting and analysis are also fall- ing into the realm of advanced computer simu- lation.
Current Architectures
The complex nature of the aforementioned problems 
requires the use of very fast computers with large memories. Currently such machines are re- ferred to as Class VI supercomputers. These 
computers are typified by such units as the Cray 2, Cyber 205, ETA 10, etc. While these machines 
represent the state-of-the-art in terms of com- 
mercially available general-purpose computers, 
and are geared towards large-scale scientific 
calculations, their performance is severely limited by their design. They typically have a 
small number of CPU's processing strings of num- bers (vectors) at high speed. The individual processors cannot efficiently communicate with 
one another during a computation, which limits 
the number of CPU's that can be incorporated into the computer. The machine speed is primarily limited by the speed of the individual electronic 
components. Current integrated circuit designs 
are no longer capable of providing the large per- 
centage gains over previous chip generations that the industry has been accustomed to. This is due to the physical constraints of solid-state physics.
To overcome the constraints in ultimate speed of individual processing chips, a new class of com- puters is emerging. These machines rely on the parallel operation of many computational and 
memory units to provide dramatic increases in global speed and capacity. Initial designs of parallel computers called for a very large number 
of rather small, slow individual units to be in- terconnected. The resulting machine was inter- 
connect intensive, which escalated the cost 
and efficiency of all phases associated with the design and implementation of the machine.
Parallel-Processing Supercomputers
Since the needs of scientific computation will dramatically outpace the performance of general- purpose supercomputers over the next decade, they 
must be addressed by some form of parallel com- puter. It is anticipated that a class of paral- lel-processing supercomputers (PPS) will emerge
2-18
to meet the needs of computationally demanding 
problems. These machines have an architecture 
which combines a number of individually power- 
ful computers together in a fashion which 
permits rapid, efficient, interprocessor 
communication to occur during the course of a 
computation. The PPS's may be roughly grouped 
into three categories: general-purpose (GP), 
multi-purpose (MP), and single-purpose (SP) 
computers. Although the overall performance 
increases from the GP to SP machines, the 
range of potential applications for the machine 
diminishes.
THE NAVIER-STOKES COMPUTER (NSC)
A current design which typifies the architec- 
ture of a PPS is the Navier-Stokes Computer 
(NSC) at Princeton University. The machine 
was originally designed for the numerical 
simulation of a wide variety of complex flows, 
governed by the general field equations of 
continuum fluid mechanics, called the Navier- 
Stokes Equations. The current machine archi- 
tecture permits the efficient, rapid numerical 
solution of a wide range of scientific problems, 
and includes the capability to display the re- 
sults in real-time using appropriate graphics. 
Thus the NSC is classified as an MP/PPS.
The NSC embodies both new architecture and 
current technological devices. Although 
the initial thrust of the machine would be in 
the area of fluid dynamics, the NSC would not 
adhere to any specific class of numerical 
algorithms. Rather, the computer would be 
designed to efficiently solve most any problem 
that met three criteria:
- the computation is numerically intensive;
- the algorithm uses long vectors (vector 
length < 10 9 );
- the algorithm is quasi local (may be paral- 
lelized with a coarse granularity)
Global NSC Topology
The gross architecture of the machine involves 
the interconnection of a rather small number 
(up to 128) of processor-memory units, called 
Nodes. Each Node has the power and speed of 
a current Class VI (e.g. Cray 2) supercomputer. 
The global topology of the internode intercon- 
nects is that of an augmented hypercube network. 
The hypercube was chosen initially because it 
provides for efficient transfer of data among 
the Nodes for a wide range of scientific com- 
putations. An illustration of a 64-element 
hypercube is given in Figure 1. The intercon- 
nect network is augmented with local "hyper- 
space routing" switches which effectively re- 
moves all data routing overhead from the Node. 
This permits the Node to concentrate on com- 
putation most (>99%) of the time. Although the 
worst-case network delay in an array of 128 
elements is seven transfer periods, most prob-
lems are solved with efficient numerical schemes 
that are "local". In this case the interproces- 
sor communication occurs between directly con- 
nected neighboring Nodes. The communication 
network is also augmented by a drop-line bus 
which permits the front-end computer to transfer 
data and commands directly to the Nodes.
The entire array of NSC Nodes is hosted by a 
general -purpose front-end computer that provides 
the operating environment. A simplified draw- 
ing the front-end relative to a portion of the 
NSC, with a subset of the internode links il- 
lustrated, is shown in Figure 2. Program devel- 
opment, workstation support, and off-line data 
analysis capabilities are supported by the front- 
end, along with more immediate duties such as 
loading, dumping, checkpointing, and monitoring 
the NSC Node array. Finally, the front-end 
provides a large amount of mass storage, in the 
form of a disc farm. This is in addition to 
the distributed disc storage associated with 
each node.
NSC Nodes
The Node derives its computational speed and 
efficiency from an architecture which provides 
for many vectors to simultaneously stream from 
multiple memory planes into a processing arith- 
metic logic unit (ALU). The ALU is comprised 
of 24 high-speed processing elements, which may 
be interconnected to form a single processing 
pipeline in which the results of an intermediate 
computation are immediately used as operands for 
a subsequent computation. High-level, often 
final, results are returned to memory. A pipe- 
line which processes the expression
y i = B ij xj J' = n" = vecto r length
is shown in Figure 3. Eight operands enter the 
top of the pipeline, and a single result is re- 
turned to memory. Thus the overhead, storage 
and complexity associated with maintaining the 
intermediate results of a computation are mostly 
eliminated. The ALU pipeline is reconfigurable, 
which eliminates many of the adverse constraints 
associated with building a machine optimized for 
a particular algorithm.
Data are routed from the multiple memory planes, 
though a switch, to the ALU pipeline, with the 
pipeline results being returned to memory through 
the switch (Figure 4). The Memory-ALU-Switch 
(MASNET) is architecturally central to the Node. 
It is a fully reconfigurable nonblocking switch, 
which is used to:
- route memory outputs (operands) to ALU pipe- 
line inputs;
- route ALU pipeline results to memory inputs 
and/or pipeline inputs (vector recursion)
- redistribute memory contents among multiple 
planes;
- extract and insert boundary data, in conjunc- 
tion with the hyperspace router, to provide
2-19
the means for internode communications with- 
out disrupting local computations. 
Another key use of the switch is the generation 
of multiple vectors from a single source. Each 
generated vector can deviate in local ordering 
from the original vector. This feature bene- 
fits most scientific computations by liberating 
large portions of memory from having to hold 
nearly duplicate copies of data, as well as 
decreasing the required memory bandwidth.
Device Technology
Major integrated-circuit technologies fall into 
several categories, several of which are con- 
sidered here. Complimentary Metal-Oxide Semi- 
conductors (CMOS) is a popular choice among 
designers because of the inherent density and 
low power. Bipolar chips offer intermediate 
densities with very high speeds at the expense 
of high power. TTL and N-channel MOS (NMOS) 
technologies, while each having some individual 
advantages, no longer occupy the majority of 
the circuit-board area in high-end computers 
due to power/speed/density limitations. Emitter 
Coupled Logic (ECL) is among the fastest pro- 
duction circuitry, although the densities are 
rather low and the power consumption is not. 
Gallium arsenide (GaAs) circuitry provides the 
potential for extremely high speeds at cryogenic 
temperatures in addition to functioning well at 
room temperatures. Due to limited experience 
in design and manufacture, with the attendant 
high cost of GaAs wafers, their present use is 
limited to several experimental computers.
The modular design of the NSC provides the 
capability to utilize a given device technology 
where appropriate. Thus, the NSC uses a com- 
bination of CMOS, bipolar, TTL and ECL cir- 
cuitry. CMOS is used primarily in the memory 
planes, while bipolar parts reside in the high- 
speed computational units.
The internode communication links are implemented 
with fiber-optic transmission lines. A duplex 
data channel between nodes sends and receives 
data in a byte-serial format. These communica- 
tion links are noise-immune, robust, and provide 
transmission rates as high as 1 billion bytes/sec 
between adjacent nodes.
Performance
Each node is designed with 16 32-bit memory 
planes (128 Mbytes/plane) for a local memory 
of 2 billion bytes (Gbytes). The Node has 24 
floating-point units available for computation 
at 20 milliori floating-point operations, per 
second (Mflops) each, giving a peak speed of 
480 Mflops (32-bit precision). Eight 32-bit 
logical units are also available for the pipe- 
line. For a 128-Node NSC, the sustained per- 
formance is projected to be in excess of 50 
Gflops, with an installed memory of 256 Gbytes.
THE NSC MICRONODE
The first effort in the realization of the NSC 
architecture is a two-board set that is referred 
to as the NSC MicroNode. One board is the 
Memory-ALU-Switch (MALUS) unit which is responsi- 
ble for the pipelined data processing, memory 
management and storage. The second board is the 
MicroNode manager which configures the MALUS 
unit and provides connection to the front-end 
computer.
The most notable aspect of the NSC MicroNode is 
that it has proven all of the innovative fea- 
tures of the NSC architecture. The MicroNode 
incorporates three Advanced Micro Devices (AMD) 
floating point processors (32-bit, 20 Mflop 
each) into a reconfigurable pipeline. The pipe- 
line has demonstrated the capacity for process- 
ing data at a peak speed of 60 Mflop. Also, 
a switch, based on Weitek register files, routes 
data between memory planes and the reconfigura- 
ble ALU pipeline. Four multiported memory 
planes are provided and each is comprised of a 
small amount of high-speed memory. Data cir- 
culation within MALUS is supervised by a micro- 
coded control unit.
An Example
The organization of the NSC MicroNode is best 
illustrated by considering the numerical simu- 
lation of 2-d potential flows. This simulation 
is essentially reducible to a solution of the 
Laplace equation,
V2$ = 0,
with appropriate boundary conditions. Although 
this equation may be solved using a wide variety 
of efficient techniques, for purpose of illustra- 
tion, the point-Jacobi relaxation algorithm is 
considered. The implementation of this example 
involves four unique MicroNode configurations, 
one of which is shown in Figure 5. The general 
MicroNode interconnections is presented in Fig- 
ure 6.
The switch, which is a small version of the 
MASNET, allows for routing of data from memory 
planes to various inputs of the pipeline. Among 
other features, it can serve for the generation 
of multiple data streams from a single source. 
Each generated data stream can deviate in local 
ordering from the original sourced stream. In 
our example, the switch is used to generate two 
data streams shifted with respect to one another 
by two elements. This operation facilitates 
summing nearest neighbor grid points as called 
for by the algorithm.
Each of the four memory planes on the MicroNode 
has complete address translation capabilities 
to provide full "scatter-gather" capability. 
The address translation information is stored 
in high-speed look-up tables. In this example,
2-20
this capability is used to provide a sweep of 
the field by rows or by columns as well as 
serving to preserve boundary values and to 
compensate for pipeline delays.
Finally, the MALUS unit is supervised by a 
microsequencer which allows for the auto- 
matic dynamic reconfiguration of the pipeline. 
In the sample problem, the microsequencer re- 
configures the MicroNode for each of the four 
computational steps. After completion of 
the four steps, the solution has been advanced 
by two iterations. The solution, after these 
four steps, is stored in the same memory plane 
with the same ordering as the initial guess. 
Thus, it is possible to have the microsequencer 
automatically repeat the four-step operation 
until the solution converges without the 
need for any intervening processing.
The convergence check is made concurrently with 
the computation using special-purpose vector 
hardware. This is an example which illustrates 
how the NSC typically eliminates the usual 
scalar hardware associated with conditional 
operations. It is the scalar unit of conven- 
tional supercomputers which often greatly limits 
their performance.
CONCLUSIONS
It has become obvious that most real problems 
in aeronautics and astronautics require a com- 
puting capability far beyond what is supplied 
by current and anticipated general-purpose 
supercomputers. It is also evident that the 
needs of the community can be met by MP or SP 
parallel processing supercomputers. These 
machines embody an architecture that funda- 
mentally relies on a multitude of high-perfor- 
mance individual units interconnected in a 
fashion which permits efficient concurrent 
operation. The Navier-Stokes computer repre- 
sents a new architecture that satisfies the 
needs of the aeronautics and astronautics com- 
munity among others.
FIGURE 1 : 6-DIMENSIONAL HYPERCUBE 
NODE ADDRESS = (ddd
2-22
12
8 
M
w
or
d 
48
0 
M
flo
p
GE
NE
RA
L 
MU
LT
INO
DE
 N
AV
IER
 - 
ST
OK
ES
 C
OM
PU
TE
R
Fi
gu
re
 2
OPERANDS
RESULTS
THE RECONFIGUREABLE ALU PIPELINE
Figure 3
2-24
2,048 MBYTES/NODE 
16 MEMORY PLANES 
128 MBYTES/PLANE
MEMORY 
PLANE
MEMORY-ALU SWITCH 
16 X 16 (X32 BITS)
NONBLOCKING 
W/INTERNAL MEMORY
RECONFIGURABLE ALU PIPELINE 
480 MFLOPS
FIGURE 4: MEMORY-ALU-SWITCH LAYOUT
2-25
MEMORY PLANE t MEMORY PLANE 2
z w
VECTOR SWITCH
X Y
X1R Y1S
R S 
VECTOR PROCESSING UNIT 1
AMD 29325 20 "FLOPS
FS3
MEMORY PLANE 3
R S 
VECTOR PROCESSING UNIT 2
AMD 29325 X 20 "FLOPS
FS4 1
MEMORY PLANE 4
R S 
VECTOR PROCESSING UNIT 3
AMD 29325 -|- 20 MFLOPS
F12
FIGURE 5 : SPECIFIC MALUS CONFIGURATION
2-26
MEMOIR? PLANE 1
A1
BUFFER 1
MEMORY PLANE 2
z w
VECTOR SWITCH 
I V
XfR LCONST VIS
R 8 
VECTOR PROCESSING UHIT 1
AMD 29325 ** " 20 MFLOPS
FS3
MEIWR¥ PLANE 3
AIB2
47-
IR S 
VECTOR PROCESSING UNIT 2
AMD 29325 ^* " xj 20 tIFLOPS
FS4
tlEiriORf PLANE 4
R S
VECTOR PROCESS!HB UNIT 3
AMD 29325 C» - K} 2O HFLOPS
B1
iPILIFlFIEig: 21
FIGURE 6: HALUS LAYOUT
2-21
KNOWLEDGE BASED PROGRAMMING AT KSC
James H. Tulley, Jr. Carl I. Delaune 
Senior Software Engineer Project Engineer 
IPS Engineering and Software Production Space Station Project Office
Lockheed Space Operations Company NASA, John F. Kennedy Space Center, Florida 
John F. Kennedy Space Center, Florida
ABSTRACT
In the last five years the knowledge based programming effort at KSC has grown from a few small 
technology studies to a viable applied research program. Our experience from this research has 
taught us to appreciate the potential of the discipline. Recent spinoff projects are adding to 
our understanding and yielding useful products. Our results indicate that knowledge based 
programming is a powerful tool which can profitably be applied in many engineering problems.
INTRODUCTION
The progress in knowledge based programming at KSC reflects that experienced by the rest of 
industry. In a few years the emphasis has shifted from research in a poorly understood field (at least it was poorly understood by us!) to development of a set of tools with applications in 
several different areas. In 1981, when we first started working with knowledge based systems, 
they were confined almost exclusively to computer science labs in a few universities. Only a 
few had been used for real applications, and they had not yet been popularized in magazines and 
on television. In 1986 knowledge based programming is recognized as a useful technique for 
capturing expertise in non-algorithmic domains and for rapid construction of prototype systems. 
KSC has several knowledge based programming projects in various stages of completion, and we 
have achieved encouraging results in nearly all of these projects.
DEFINITIONS
The terms artificial intelligence, knowledge based systems, and expert systems, are often used 
interchangeably. Although closely related, they have different meanings. The definitions we 
use (which are commonly but not universally accepted) are as follows:
Artificial Intelligence (AI) - A discipline of computer science, AI is the study of ideas that 
enable computers to perform intelligently. This definition is a large umbrella that attracts 
many types of research and includes many topics. AI includes such diverse problems as pattern 
matching, techniques of search, vision, natural language understanding, and knowledge based 
systems. Its practitioners include computer scientists, psychologists, linguists, and many 
others. The glue that binds these people together is the difficulty of the problems that 
they pursue. Problems that do not appear soluble by traditional programming methods are often 
bundled with the other "hard" or mysterious AI problems. An ironic side effect is that AI 
workers tend to lose the fruits of their labor. If AI research leads to better understanding 
or the solution of a problem, the problem loses its mystery and hence its claim to be AI.
Knowledge based programming has at its core the idea that the domain specific knowledge in a 
computer program can be kept separate from its control structure. This idea can be implemented 
in several ways, of which rule based systems is probably the most familiar. In rule based 
systems knowledge is expressed in rules or predicate calculus assertions; another part of the 
system controls the execution of these rules. The archetypical rule based system is MYCIN
2-28
