A multiarchitecture parallel-processing development environment by Townsend, Scott et al.
^a^93
- 77q
NASA Technical Memorandum 106180
A Multi architecture Parallel-Processing
Development Environment
Scott Townsend
Sverdrup Technology, Inc.
Lewis Research Center Group
Brook Park, Ohio
and
Richard Blech and Gary Cole
National Aeronautics and Space Administration
Lewis Research Center
Cleveland, Ohio
Prepared for the
Seventh International Parallel Processing Symposium
sponsored by the Institute of Electrical and Electronics
Engineers Computer Society
Newport Beach, California, April 13-16, 1993
NASA
L
I
https://ntrs.nasa.gov/search.jsp?R=19930019439 2020-03-17T04:41:36+00:00Z
A Multiarchitecture Parallel-Processing
Development Environment
Scott Townsend *	 Richard Blech and Gary Cole
Sverdrup Technology, Inc	 National Aeronautics and Space Administration
Lewis Research Center Group	 Lewis Research Center
Brook Park, Ohio 44142	 Cleveland, Ohio 44135
Abstract
A description is given of the hardware and soft-
ware of a multiprocessor test bed — the second gen-
eration. Hypercluster system. The Hypercluster archi-
tecture consists of a standard hypercube distributed-
memory topology, with multiprocessor shared-memory
nodes. By using standard, off-the-shelf hardware, the
system can be upgraded to use rapidly improving com-
puter technology. The Hypercluster's multiarchitec-
ture nature makes it suitable for researching parallel
algorithms in computational field simulation applica-
tions (e.g., computational fluid dynamics). The dedi-
cated test-bed environment of the Hypercluster and its
custom-built software allows experiments with various
parallel-processing concepts such as message passing
algorithms, debugging tools, and computational "steer-
ing". Such research would be difficult, if not impossi-
ble, to achieve on shared, commercial systems.
Keywords — parallel processing, system software,
computer architecture, computational fluid mechan-
ics.
1 Introduction
For over ten years, the NASA Lewis Research Cen-
ter has been developing test-bed systems for research-
ing the hardware and software aspects of parallel pro-
cessing. Early efforts [1-6] focused on using multiple
microprocessors to achieve low-cost, real-time siinula-
tion of airbreathing propulsion systems described by
systems of ordinary differential equations. Since then.
attention has turned to computational field simula-
tion, involving the solution of systems of partial dif-
ferential equations (e.g., computational fluid dynamics
(CFD)).
Because field simulation can involve multiple lev-
els of parallelism, the Hypercluster was conceived as
a multiarchitecture test bed [7] to research parallel
processing issues concerning CFD algorithms and ap-
plications, as well as system software tools and util-
ities. The system consists of a front-end processor
connected to a network of nodes which are arranged
in a hypercube distributed-memory topologv. Each
node can have multiple processors connected through
shared memory, and multiple communication links to
''This work was supported by the NASA Lewis Research Cen-
ter under contract NAS3-25266 with Gary Cole as monitor.
Figure 1: System topology
other nodes in the system. Applications are written
with explicit message passing and/or shared memory
constructs. The parallel processing library developed
to support these constructs on the Hypercluster is de-
scribed in [8]. A message passing kernel and an initial
operating capability for the original Hypercluster sys-
tem are described in [9] and [10], respectively.
This report describes the current, second genera-
tion Hypercluster. The system hardware is described
first, followed by a discussion of the s ystem software
and software development of applications. Finally, a
brief overview of application and system software re-
search conducted on the Hypercluster is given.
2 System hardware
The system is built exclusively from off-the-shelf
hardware, described below.
2.1 Front-end processor (FEP)
At the highest level, the hardware consists of a
Front-End Processor (FEP) connected to a network
of multiprocessor nodes (see Figure 1).
The FEP is a Silicon Graphics Personal Iris which
runs a version of UNIX. It. communicates with the net-
work of nodes via an intermediary Backplane Service
Processor (BSP). Communication between the FEP
and BSP occurs over a VAIE to VME bus repeater.
Communication between the BSP and the network of
nodes occurs over a dual-port memory communication
link identical to the links (described below) used be-
tween nodes.
2.2 Network processors
There are a total of 40 processors in the network,
of which 32 are used for application processing. The
CP	 QP	 Comm. Links
I
P FP] P P
 C
 000
 
128MB RAM
VME
P: 88100 processor
C: 88200 cache/memory management unit
Figure 2: Node architecture
network consists of eight nodes arranged in a hyper-
cube topology. Each node contains a Communications
Processor (CP), a Quad Processor (QP), and three
communication links (see Figure 2). Node zero has an
extra communication link for communicating with the
BSP. Physically, each node is in a separate VME card
cage. The CP and QP within a node communicate via
shared memory on the VME bus.
The CP handles all communications between nodes
independently of the applications running on the QP.
It can also be used for application processing, though
this will detract from overall performance if the appli-
cation is communication-bound. The CP is a Motorola
88000 processor (described below) with 8MB of mem-
ory. The QP is used for application processing only.
It consists of four 88000 processors sharing a total of
128MB of cache-coherent memory. This memor y is
normally accessed from a private bus, so there is little
interaction with VME traffic resulting from node-to-
node communications.
Each Motorola 88000 processor used in the CP
and QP modules is a RISC (Reduced Instruction Set
Computer) consisting of one 88100 microprocessor and
two 88200 cache/memory management chips. The
88100 contains 32 32-bit registers, five independent
execution units (instruction fetch, data load/store, in-
teger operations, floating-point add operations, and
floating-point multiply operations), and is heavily
pipelined. Each 88200 chip contains 16KB of cache
and the logic required for demand-paged virtual mem-
ory. There are two 88200s per 88100 to support simul-
taneous instruction and data transactions.
2.3 Communication links
The communication links between nodes are Bit3
Corporation VME to VME adaptors with Direct
Memory Access (DMA) capability. Each pair of
boards (one in each node's card cage) have a cable and
128KB of dual-port memory between them. Messages
are passed between nodes as the source CP copies a
packet to the dual-port memory and then sends an in-
terrupt to the destination node. The destination CP
will then copy the message out of the dual-port mem-
ory and process it. Due to hardware limitations in the
links, the DMA circuitr y can only be used for one of
the packet copy operations.
The link dual-port memory is evenly split between
0 _17( Server
Client #1	 O	 O
3
Agent
Client #2
	 n
Figure 3: Client — server processes
the two connected nodes, which use their assigned
half for outbound packets and receive packets from
the other node's half. VME bus contention is reduced
on the receiving node by using the dual-port memory
rather than directly manipulating the receiver's VME
memory.
The link dual-port memory could also be used as a.
(small) memory shared between application processes.
Future research may look into what advantages might
result from such usage.
3 System software
The system software is split between front.-end soft-
ware which runs on the FEP and user workstations,
and a Message Passing Kernel (?VIPK) which runs on
the CP and QP processors. The front-end software is
written using a client-server model. The server code
must run on the FEP, while the client code can be run
from the FEP or any workstation on the Internet. Dif-
ferent client programs support different styles of user
interaction, all accessing the same server.
3.1 Server software
The server software is responsible for providing lo-
cal or remote client access, local client file I/O, system
configuration, kernel and application loading, and sys-
tem health checks. (Local clients are client processes
which run on the FEP processor, remote clients access
the FEP via the Internet). The server also acts as a.
gateway between the client, and the application code
running on the network processors. The server accepts
requests from either a UNIX domain socket for local
clients or an Internet domain socket for remote clients.
Client login follows a three step process as shown in
Figure 3:
1. The client sends a login request to the well-known
server socket.
2. If the system is in use, the server sends a reply to
the client identifying the current user and rejects
the login. If the system is available, the server will
fork a copy of itself, known as the agent process.
This agent process will then reply to the client.
with a message indicating that the login has been
accepted. Once the agent is started, the server
resumes monitoring its well-known porn for login
requests from other users.
3. The client proceeds to issue requests and receive
replies with the agent process.
If the client requires multiple processes, the addi-
tional processes can register with the agent process
(shown as step 4 in Figure 3). The registration scheme
allows additional client processes to be used as mes-
sage handlers.
For local clients, the agent process will handle all
file I/O requests made by the application directly, ex-
cept for 1/0 to the user's terminal. For remote clients
or local client terminal I/0, the agent will forward the
1/0 request on to the client. In this way all applica-
tion 1/0 occurs in the user's local file system even if
the user is running a client from a remote workstation.
Applications which have heavy I/O traffic will achieve
higher performance when run from a local client. With
up to 32 application processors running, it is quite pos-
sible to exceed the agent's (or remote client's) maxi-
mum number of open files. To avoid this problem, the
agent and remote client) code employs a file descrip-
tor caching scheme to provide an unlimited number of
open files.
The agent allows the user to configure the kernels
to be loaded on the BSP, CPs, and QPs and perforin
a system reboot. In this way new kernels can be de-
bugged by ordinary clients without locking-out other
users between runs. (At each user logout the system
is rebooted with the standard kernels in a procedure
that takes under 20 seconds.) The user may also con-
figure how many processors in the QP module should
actually be used. By changing the number of active
QP module processors, the user can alter how much
memory an application may have per processor from
32MB to 128MB. Thus the s ystem can be arranged
in configurations which vary between 32 processors of
32MB each to 8 processors of 128MB each. This ca-
pability has been found useful in the initial porting
stages of large sequential codes.
The agent periodically checks the health of the
system by sending a short message to each proces-
sor. Upon reception, the destination processor simply
echoes the message back to the agent. If the agent
does not see the echo within a timeout period, the
failed processor is noted and the system automatically
reboots. These health check messages are limited to
one per second to minimize disruption of a running
application. During kernel debugging a typical fail-
ure mode is for messaging between nodes to become
deadlocked, and these health checks will detect this
condition. While running applications it is rare for
the system to fail in this manner, though it is still
possible with a "runaway" application since the ker-
nel does not enforce flow control.
Whenever the agent receives a message from the
system which is not intended for it, it will forward this
message to the client via the UNIX domain socket or
the Internet domain socket depending upon the type of
client. Multiple process clients will have the message
forwarded to the appropriate client process depending
on the type of message and which process registered as
a handler for that t ype. If the agent receives a message
from the client which is destined for a processor in the
network of nodes, it is forwarded via the BSP to the
destination. This gateway function allows arbitrary
communication between (possibly remote) clients and
Application
APPL	 POSIX 1/0
Interface Library
MPK
Hardware
Figure 4: Software layers
the application running in the system.
3.2 Client software
To support writing local or remote clients, a set of
high-level routines is provided in a library. This li-
brary provides support for accessing the server, send-
ing commands, receiving replies, handling file I/O
messages from the application, searching the applica-
tion symbol table, manipulating application variables,
and communicating with application processors. In
addition, a symbolic traceback is provided automati-
cally upon any abnormal application exit. Using this
library, simple clients such as a line-oriented debugger
take only a few pages of C code.
Existing clients include the above line-oriented de-
bugger, a menu-oriented debugger, a client for running
applications written with the Application Portable
Parallel Library (APPL, [11], discussed later), a
client which gathers system statistics and displays
the results in various charts, and a few application-
specific clients (see the "Parallel-Processing Software
Research" section).
3.3 Message passing kernel (MPK)
The MPIi supports a single task with primitives
for explicit message passing, POSIX-style I/0, and
shared memory operations within a node. These fa-
cilities are accessed through a library of interface rou-
tines. Normally the application will use APPL as a
portable laver between it and the Hypercluster-specific
routines of the MPK. Figure 4 depicts the layering be-
tween the application and the MPK. The current im-
plementation is very similar in concept to that of the
previous version [9], with changes made to support
the new hardware. The previous version was written
in assembler code to optimize speed, while the current
version is written primarily in 'C', with only hardware
interface and buffer copy routines written in assem-
bly. A limited number of experiments have shown the
C compiler to be sufficiently close to the efficiency of
an assembler version to justify the convenience of a
higher level language.
The shared memory operations supported by the
MPK include barriers and semaphore operations.
Routines are also available for determining the sys-
tem configuration, the applications' node and proces-
sor number, and the system time.
Message passing primitives include buffered and
non-buffered send, broadcast, blocking and non-
blocking receive, and direct remote memory manip-
Illation (used for loading and debugging). Messages
are routed based upon destination node and processor
number. Messages within a node are sent taking ad-
vantage of the node's shared memory, while those des-
tined outside the node will traverse one or more com-
munication links in a simple store-and-forward fash-
ion. This process could potentially include the Inter-
net for a message intended for a remote client pro-
cess. Large messages (larger than the current max-
imum packet size of 4KB) are split into packets be-
fore being sent outside the node. Due to deterministic
routing of packets, message reassembly at the receiver
is trivial. (Forwarding across the Internet is done via
TCP, so remote client processes will still see a serial
stream of MPK packets).
Application messages may have a type assigned to
them by the user. Messages of the same type are held
in separate queues by the receiving processor's MPI{
until the receiving application performs a receive op-
eration requesting the corresponding type. A special
value may be used to receive the next message of any
type.
There is no flow control enforced by the MPK mes-
saging scheme. For the applications studied so far this
has not been a problem. Computations typically pro-
ceed through an iteration, exchange data, and begin
the next iteration cycle. This style of algorithm tends
to be self-pacing. Adding flow control to the MPIi in
the future is under consideration.
4 Application software
Hypercluster applications are written in standard
sequential FORTRAN and/or C and are compiled us-
ing a commercial cross-compiler running on aSun
SPARCstation. Full runtime libraries for both lan-
guages are provided, making the porting process of
existing sequential codes to a single Hypercluster pro-
cessor straightforward.
All parallel constructs must be explicitly controlled
by the programmer, the compilers do not support any
parallel extensions or perform any automatic parallel
optimizations. An interface library is used to access
the facilities provided by the MPK. In addition, a port
of APPL is available and is the preferred mechanism
for using the parallel processing capabilities of the sys-
tem. Codes using APPL are portable to a wide range
of machines besides the Hypercluster.
5 Parallel-processing software research
The Hypercluster has been used to support a num-
ber of software research activities. A few examples of
these are briefly described in the following sections.
5.1 Applications
Due to the difficulty of obtaining reliable access to
a commercial parallel processor system at Lewis, the
Hypercluster was used to develop and debug a paral-
lel version of a large, 3-D CFD turbomachiner y code
referred to as ISTAGE [12]. The parallel version was
programmed using the APPL, and was subsequently
ported to commercial MIMD machines without mod-
ification. Figure 5 displays the speedup characteris-
tics of this application on the Hypercluster and Intel
iPSC/860. A viscous version of the code, MSTAGE,
32
28 _ -^---^
24
20
_-
1s	
--
iz	 .....-
0	 48	 12	 :18	 :20: : 24	 28	 32
Processors
Figure 5: ISTAGE speedup
is now being parallelized for the Hypercluster in an-
ticipation of its use with the Integrated CFD and Ex-
periment (ICE) Project [13].
A trivial fractal application was written to demon-
strate the interactive capabilities of Hypercluster
client programs. The user specifies on the screen what
area of the Mandelbrot set, is to be evaluated and the
client sends messages to Hypercluster processors to
perform the evaluation. The client performs load bal-
ancing by giving each processor one row of the display
to calculate; when a processor replies with its results,
it is given the parameters of the next unevaluated row.
This "embarassingly" parallel application also demon-
strates the peak Hypercluster computational capabil-
ity of 240 _MFLOPS when using all 40 processors.
For a more representative view of the Hyper-
cluster's performance, the nCUBE version of the
SLALOM benchmark was ported and run [14]. Only
16 of the available 32 processors were used since this
parallel version of SLALOM assumes a square hyper-
cube topology. The system was able to evaluate 933
patches in the required 60 seconds, corresponding to
a rate of 14.7 MFLOPS. This level of performance
was achieved without an y special coding tricks or op-
timizations. Tuning to the 88100 architecture should
improve this figure.
5.2 System software
The Hypercluster played a significant role in the
development and debugging of the APPL. The ini-
tial target machines for the APPL project were the
Intel iPSC/860. the Alliant F1/80, and the Hyper-
cluster. These represent three different classes of ar-
chitectures: a distributed memory machine, a shared
memory machine. and a machine which has a combi-
nation of shared and distributed memory. Although
the Hypercluster and the Intel iPSC/860 were simi-
lar, the differences in the two architectures required
developing a programming model which allowed aii
application to run on either machine without modifi-
cation. This resulted in the APPL process definition
concept, which makes an APPL program "architecture
independent".
In the area of performance visualization, a special
client sub-process has been written to gather and dis-
play statistics for the system while it is running an
application. Statistics such as CPU usage, message
communication rates, and link communication rates
for the entire system are shown as bar graphs in the
main panel. A more detailed view of a given proces-
sor or link can be called up by clicking on the corre-
sponding chart in the main panel with the mouse. The
detailed displays show trends over the last thirty sec-
onds along with numerical totals. Figure 6 shows the
display during a run of the SLALOM benchmark de-
scribed above. Performance problems such as load im-
balances or link congestion can be seen with a glance
at the main panel. This capability is available to any
client running on a Silicon Graphics workstation by
linking with the standard client support library.
As a first step towards system support capable
of "steering" a computation, a sample application-
specific client has been written to interactivel y view
and manipulate a calculation as it progresses (see Fig-
ure 7). This client uses the user's workstation for
graphical visualization of results as they are com-
puted. Results may also be recorded for later play-
back. Coefficients used in the algorithm may be al-
tered in real-time and the effect on the solution pro-
cess is displayed. Currently only the coefficients are
directly manipulated by the client; the application
running in the Hypercluster processors is responsible
for distributing the initial temperatures and assem-
bling the (distributed) solution data into BSP mem-
ory, which the client then accesses for display. With
additional work in the area. of describing how appli-
cation data is distributed among processors, it would
be possible for the client to distribute the initial tem-
perature data and assemble the solution data directly
from the Hypercluster processors for display. This ap-
proach may be investigated as part of future system
software research activities.
6 Concluding remarks
The above description of the second generation Hy-
percluster shows it to be a parallel processor based
upon hypercube topology, with each node a shared
memory multiprocessor in its own right. The new
hardware uses contemporary RISC processors with a
front end computer running UNIX.
The dedicated test-bed environment and custom-
built software make the Hypercluster ideal for parallel-
processing research in both the application and system
software arenas. The Hypercluster has been used to
conduct software experiments that would have been
difficult or even impossible to do on a shared, com-
mercial system.
Finally, the system continues to evolve. Possible
future work includes investigations into better sup-
port for application debugging, performance visual-
ization, and the interactive steering of a computation
in progress.
References
[1] Blech, R. A. and Arpasi, D. J.: "Hardware for
a Real-Time Multiprocessor Simulator." NASA
TM 83805, 1985.
[2] Blech, R. A. and Williams, A. D.: "Hardware
Configuration for a Real-Time Multiprocessor
Simulator." NASA TM 88802, 1986.
[3] Arpasi, D. J.: "RTMPL—A Structured Program-
ming and Documentation Utility for Real-Time
Multiprocessor Simulations." NASA TM 83606,
1984.
[4] Cole, G. L.: "Operating System for a Real-
Time Multiprocessor Propulsion System Simula-
tor." NASA TM 83605, 1984.
[5] Arpasi, D. J. and Milner, E. J.: "Partitioning
and Packing Mathematical Simulation Models for
Calculation on Parallel Computers." NASA TM
87170,1986.
[6] Milner, E. J. and Arpasi, D. J.: "Simulating a
Small Turbosha.ft Engine in a. Real-Time Mul-
tiprocessor Simulator (RTMPS) Environment."
NASA TM 87216, 1986.
[7] Blech, R. A.: "The H ypercluster: A Paral-
lel Processing Test-Bed Architecture for Com-
putational Mechanics Applications." NASA TM-
89823, 1987.
[8] Quealy, A.: "Hypercluster Parallel Processing Li-
brary User's Manual." NASA CR 185231, 1990.
[9] Blech, R. A.; Quealy, A.; and Cole, G. L.:
"A Message-Passing Kernel for the Hyperclus-
ter Parallel-Processing Test Bed." NASA TM-
101952,1989.
[10] Cole, G. L.: Blech, R. A.; and Quealy, A.:
"Initial Operating Capability for the Hyperclus-
ter Parallel-Processing Test Bed." NASA TM-
101953,1989.
[11] Quealy, A.: "Portable Programming on Paral-
lel/Networked Computers Using the Application
Portable Parallel Library (APPL)." to be pub-
lished.
[12] Blech, R.A.; Milner, E. J.; Quealy, A.; and
Townsend, S.E.: "Turbomachiner y CFD on Par-
allel Computers." NASA TM-105932
[13] Szuch, J. R.; and Arpasi, D. J.: "Enhancing
Aeropropulsion Research with High-Speed Inter-
active Computing." NASA TM 104374, 1991.
[14] Gustafson, J., et al.: "The Design of a Scalable,
Fixed-Time Computer Benchmark." J. of Paral-
lel and Distributed Computing 12,388-401, 1991.
TemperaturesCoefficients
--
NORTH
ALPM 2 1 0.00001
	
1 100	 / 25 q/ 1
	
0 [9 ::::; ALPHA 1 1 6-6- q►
	
WEST	 /	 0	 EAST	 :: ALPIAC 1
	
0.1	 q/
SOUTH
EJ 6W SVP 2D Plor&Yk lei Y
1 0 CONTOUR PLOT ._ ---
. ._._._1
tI	 q LINES	 O LABELS
0 POLYGONS
LEVELS I FF I
MIN COLOR
WAX COLOR 1	 P-D-'^
0 SURFACE PLOT
0 9.100TH SHADED --
0 GRID PLOT
ED m SVPDkp* Sryk• catia> r s,1 V
I"
0 2D DISPLAY
V FILLED
0 3D DGPAV
E3 FULLED
0 SKADED
0 FLIP NOFUMLS
0 REFERENCE VOLE
0 PERSPECTfA PRO.IECTION
gJ w HyPr%*arrr SMrr IJ	 iiiiiiiiiiiiiiiiiiiiiu l
CP	 OP	 LINK
7 a OaL 44 _	 oa. p. D•
6 a Cll LSL _	 o` o` O^
5 ED	 Clt. _ ._ D4r 0^ Or
4 El mL 21L _ _ PL 0r of
3 ^ ma .'".a _ _ P^ 3. of
2a ^4 _ _ or i ^.
10 mL CL _ _ of l >i Or
Da	
_ _ 0. IW D•
q CPU 0 A6C5 SENT	 0 PKTS SENT
n BYTES SENT n BYTES SENT
n K6G5 RECVD n PKT NS RED
n BYTES RECVD a EYTES RECVD
SAe.PLES. 740	 PY'S: P^"Tr
El w Node IPr 3wr0 SrrYLVlcs s U
TOTAL: 1.0166E-05
^, IT1'IAL: 8.6.1138.07
MESSM
J	 '^ J TOTAL: 1.1669E-0S
.E5S4E5 EQtitJ
TOTAL: 0.42036.07
7J
vu BUFFER 9SIPS: 0
BUFFER 1411 T8: 0
QUEUE e01 IH: 0
CHLN1IE1101I9: 0
n7E wlT9: 0
10i8HAC£ 1AfTH: 84SJ2
ED W Nkde I L.OL' 0 S7a. 1-	 U
To:u,: 741sT
.•pQn SV.
TOT": 6.9379E-0;pvit3 SM1i
-^ TOTAL: 01665
.K41D K
TOTAL: 4.9565E-07
v. =Ea..tP
IRT E, UFT •J FECES VEp : BOBB2
IrtTE1RUP:9 PEMVI. LD: 0
LINK 2ASIi: 0
PAP ITY ePIlC4te: 0
d ar is	 on am U
ap o 9
8.2:	 S.tUpl	 2.986 15.02211. 13 901831 a^1 i
0. 0:	 S.tUp2	 6.285 10]]403,0. 16.792803 10.] i
SatUp3	 8.216 730823. 16.772672
15 ..11 51112
D.2 i
0.0:	 Solver	 18.9]3 73826]"8. 81.4 i
0.0:	 Storer	 1.656 44080. 8.027097 2.8 S
0.8:	 TOTAL	 60.141
0.0:	 Mw	 Interval:	 (
884907880.
920,
14.713389
9351
1" 8 t
0.0:	 931	 P•Ir A.4
0.0:	 TAek	 Secon4e Opereliorre MFLOPS % a1 TI..
0.8:	 .ea g er	 8.316 258. 0.206815 0.5 S
2.088
S.tUpl	 2.891
1693.
40B9a718.
0.21670.1
13.070685
13.2 S
4.9 4
0.0:	 SetUp2	 6.223 15.5137208. 16.8.]x86 10.5 4
S.tep38.5.7 72	 79. 15.455552 9.3	 i
0.2:	 Solver	 18.387
8.0:	 Storer	 1.656
728187384
sIGBB.
328
86901..2
81.3 S
2.0 I
0.2:	 TOTAL	 39.52] .13	 9, 14.676291 180.0 S
.2:	 Mw	 Interval:	 L 931. 935]
Figure 6: Runtime statistics during SLALOM benchmark
Figure 7: Visualization and steering of a computation
Form ApprovedREPORT DOCUMENTATION PAGE OMB No. 0704-0188
Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources,
gathering and maintaining the data needed, and completing and reviewing the collection of information. 	 Send comments regarding this burden estimate or any other aspect of this
collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson
Davis Highway, Suite 1204, Arlington, VA	 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188), Washington, DC	 20503.
1. AGENCY USE ONLY (Leave blank) 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED
April 1993 Technical Memorandum
4. TITLE AND SUBTITLE 5. FUNDING NUMBERS
A Multiarchitecture Parallel-Processing Development Environment
WU-505-62-52
6. AUTHOR(S)
Scott Townsend, Richard Blech, and Gary Cole
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION
REPORT NUMBER
National Aeronautics and Space Administration
Lewis Research Center E-7744
Cleveland, Ohio 44135-3191
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSORING/MONITORING
AGENCY REPORT NUMBER
National Aeronautics and Space Administration
Washington, D.C. 20546-0001 NASA TM-106180
11. SUPPLEMENTARY NOTES
Prepared for the Seventh International Parallel Processing Symposium sponsored by the Institute of Electrical and Electronics Engineers Computer Society,
Newport Beach, California, April 13-16, 1993. Scott Townsend, Sverdrup Technology, Inc., Lewis Research Center Group, 2001 Aerospace Parkway, Brook
Park, Ohio 44142 and Richard Blech and Gary Cole, NASA Lewis Research Center, Cleveland, Ohio. Responsible person, Scott Townsend, (216) 433-8101.
12a. DISTRIBUTION/AVAILABILITY STATEMENT 12b. DISTRIBUTION CODE
Unclassified - Unlimited
Subject Category 62
13. ABSTRACT (Maximum 200 words)
A description is given of the hardware and software of a multiprocessor test bed—the second generation Hypercluster
system. The Hypercluster architecture consists of a standard hypercube distributed-memory topology, with multipro-
cessor shared-memory nodes. By using standard, off-the-shelf hardware, the system can be upgraded to use rapidly
improving computer technology. The Hypercluster's multiarchitecture nature makes it suitable for researching parallel
algorithms in computational field simulation applications (e.g., computational fluid dynamics). The dedicated test-bed
environment of the Hypercluster and its custom-built software allows experiments with various parallel-processing
concepts such as message passing algorithms, debugging tools, and computational "steering". Such research would be
difficult, if not impossible, to achieve on shared, commercial systems. Keywords—parallel processing, system
software, computer architecture, computational fluid mechanics.
14. SUBJECT TERMS 15. NUMBER OF PAGES
Parallel processing; System software; Computer architecture; Computational 7
16. PRICE CODEfluid mechanics A02
17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION 19. SECURrrY CLASSIFICATION 20- LIMITATION OF ABSTRACT
OF REPORT OF THIS PAGE OF ABSTRACT
Unclassified Unclassified Unclassified
NSN 7540-01-280-5500	 Standard Form 298 (Rev. 2-89)
Prescribed by ANSI Std. Z39-18
298-102
