The finite element machine:  An experiment in parallel processing by Crockett, T. W. et al.
JAJ._Try-S.€o-//
NASA-TM-84514 19820024127NASA Technical Memorandum 84514
,!
THE FINITE ELEMENTMACHINE: AN EXPERIMENTIN
PARALLELPROCESSING ,, _....
O. O. Storaas]i, S. W. Peebles, T. W. Crockett,













THE FINITEEI_24ENTMACHINE: An ExperimentinParallelProcessing
0. 0. Storaasli and S. W. Peebles
" NASA Langley Research Center
Hampton, Virginia







The Finite Element Machine at the NASA Langley Research Center is a pro-
totype computer designed to support parallel solutions to structural analysis
problems. This paper describes the hardware architecture and support software
for the machine, initial solution algorithms and test applications, prelimi-
nary results, and directions for future work.
INTRODUCTION
A large class of structural analysis problems is solved by computer using
finite element and finite difference approximation techniques. Although these
problems have traditionally been solved on conventional sequential computers,
an analysis of these methods shows that they contain many calculations which
could be performed simultaneously, thereby reducing the time required for a
solution (ref. 1). To support this concurrency, special computers are needed
which can perform many operations in parallel. One option is to construct
vector computers which operate on large arrays of data, but this approach is
only effective when the data can be structured appropriately. A different
approach is to construct a machine which consists of a large number of
general-purpose processing elements coupled together in a parallel architec-
" ture. Advances in microcomputer technology during the last decade have
reduced the size and cost of computing elements, making construction of this
type of parallel processor increasingly practical. Such processors are being
actively investigated for their potential uses. Examples include CM* and
C.mmp at Carnegie-Mellon University (ref. 2), ZMOB at the University of
Maryland (ref. 3), and the New York University Ultracomputer (ref. 4).
Work is currently underway at NASA Langley Research Center to investigate
solutions of structural analysis problems using parallel microprocessor
 f2-3z 3
systems. Research topics include hardware configurations, software design,
problem partitioning, and numerical algorithms. As part of this effort, a
prototype parallel processor designated the Finite Element Machine (FEM) is
being built and evaluated. This paper describes the Finite Element Machine,
its support software, current applications and algorithms, and preliminar/ re-
sults.
F_ DESCRIPTION
To support parall_l processing, an appropriate combination of hardware
features and system software is needed. This section first outlines the FEM
hardware organization, and then describes two packages of system software de-
veloped to provide control and run-time support.
Hardware Architecture
The architecture of the Finite Element Machine was specifically designed
to support a parallel decomposition of structural problems by assigning nodes
in the structural model to processors in the machine (refs. I, 5, and 6).
This approach is illustrated in figure 1 with an idealized wing model. The
calculations to be performed at nodes in the model are mapped onto the array
of microprocessors. The lines drawn between microprocessors indicate depen-
dencies between nodes which lie on the same finite element, but have been
mapped into different processors. Because of these dependencies, data trans-
fer is required between these processors. The number of structural nodes is
not limited to the number of processors, since multiple nodes may be assigned
to a single processor (see, for example, ref. 7). The mapping of nodes onto
processors is discussed in more detail in later sections.
The Finite Element Machine is a multiple-instruction multiple-data (MIMD)
parallel processor consisting of an asynchronous array of interconnected
microcomputers (the Array) linked to a minicomputer front end (the Control-
ler). A block diagram of the architecture is shown in figure 2. Unlike many
mnltiprocessor designs which use large shared memories, the FEM architecture
provides each processor with its own local memory, and no sharing is
possible. Instead, special communication hardware (described below) allows
the processors to communicate with each other. The current prototype machine
(figure 3) is being built in stages of h, 16, and 36 processors. In princi-
ple, however, the architecture could be expanded to accommodate large numbers
of processors (perhaps hundreds or thousands). At this writing, a four-
processor Array is operational and the hardware for the 16- and 36-processor
stages is nearly complete.
All processors in the Array are identical and consist of a 16-bit micro-
processor, an attached floating-point unit, 32K bytes of random access memory
(RAM), and 16K bytes of read-only memory (ROM). Serial I/O ports called
"local links" provide data communication paths between a processor and up to
twelve of its neighbors. The local links are reconfigurable and can support a
variety of interconnection topologies. A time-multiplexed parallel "global
2
r
bus" connects all processors to each other and to the controller, and provides
a general-purpose secondary communications path. A network of binary flags
spans the Array and is used for processor synchronization and other signaling
° needs. A distributed "sum/maximum" network computes the sum and maximum of
the inputs from all of the processors. This can be used for global
calculations (ref. 8), cooperative sorting, and processor sequencing. For
more details on the Array hardware, see reference 9-
The Controller is a small minicomputer which initiates and monitors
activity on the Array and provides mass storage for programs and data on
attached peripherals. It also hosts the user interface to the system,
including interactive graphics.
Controller Support Software
The Controller provides the user with program development tools and the
ability to define problems, activate and monitor the Array, and obtain and
process results. The Controller runs a general-purpose disk-based operating
system accessed by a menu-driven command interpreter. Commands on the
Controller are implemented as control language procedures. All FEM commands
are constructed in this manner. This approach presents the user with a
consistent interface which is a natural extension of the Controller operating
system. The system software provided on the Controller can be divided into
four functional areas of support: program development, problem description,
program execution on the Array, and post-processing or analysis.
Program development on the Controller is supported primarily by the
vendor's standard software. A screen editor, an assembler, a reverse
assembler, a Pascal compiler, and a link editor are available. Parallel
application programs for the Array are written in Pascal; support for the FEM
architecture is provided by a library of special routines. Users ordinarily
select a program from a package of solution algorithms available for general
use. Should a user prefer to write his own solution code or require special
post-processing of data, he has access to all the necessary tools on the
Controller.
Before executing the parallel solution program, the user n_st model his
problem and provide data in accordance with the protocols of the intended
solution algorithm. An interactive graphics interface allows the user to
model structures and generate nodal coordinates, element connectivity,
material properties, and constraints. As an alternative, the user may choose
to create or modify data files using the text editor or his own utility
program.
Normally, the program execution environment is established by entering a
single Controller command. One command can be structured to call all
necessary sub-commands via the command interpreter. Typically, the execution
session involves Array initialization, selection of the Array configuration
desired, downloading the selected algorithm and any necessary data, and
entering an interactive execute mode. During program execution on the Array,
all messages from a preselected reference processor and errors from all
3
processorsare displayedon the user's CRT. Processorscan send messages,
make interactivequeries,and report error conditionsat any time duringthe
execute sequence.
Three files are maintainedfor the user during executionon FEM. A
"FEMDATA"file recordsall data transferredto the Controller,formattedin
accordancewith its type, and identifiedby sourceprocessornumber. A
"FEMLOG"file is used to record events over the course of an entire FEM ses-
sion. This file is initializedwhen the Array is reset,and thereafter
recordseach commandinvocationalong with its associateddata. For example,
the commandto downloada programwrites entry and exit messages,the name of
the file downloaded,status information,and the load and entry addressesof
the object code for each of the affectedprocessors. The FEMLOG also records
the source and error number for all errors as they occur. This processpro-
vides the user with a sessionrecord for later referenceor analysis. A
"FEMERROR"file is initializedat the beginningof each command. This file is
used to record errors detectedwithin the scope of a single command. It con-
tains the error number,severity,source,processorstatus information,and an
expandederror message for each error detected. If an error is detected
duringexecutionof any command,the FEMERRORfile is displayedupon exit.
Additionaluser supportis provided in the form of interactivedebug com-
mands. Debuggingcommandsallow the user to dump memory and set breakpoints,
to single step, halt, kill, and resume tasks, and to inspectand change
status,registers,and memory. In addition,the Array keeps execution
statisticsand can be directedto trace executionand check in with the
Controllerat regular intervalsto maintainconfidenceduring long
computations.
Software supporton the Controllerfor post-processingand analysis of
data consistsof a set of utilityprograms. Commandsare providedto sort the
data file to provide a listingof communicationsby processor,analyze the
trace informationto determinewhere eachprocessor spent its time during
execution,and upload and formatthe resultsof computationson the Array. In
addition,informationsuch as nodal displacementscan be displayedin graphic
form (see figure h).
System Software for the Array
System software for the array of microprocessors consists of an operating
system, a subroutine library, and a set of diagnostics. The relationships of
these components to each other and to the applications software are shown in
figure 5. Although diagnostics are vitally important for the validation and
maintenance of a computer system, they are beyond the scope of this paper.
A complete copy of the operating system, called Nodal Exec, is stored in
R0M on each of the microcomputers in the Array. Nodal Exec is divided into
two major sections, a nucleus and a package of command routines. The nucleus
(the innermost portion of the operating system) provides such functions as
interrupt handling, basic I/O, timing, memory allocation, task management, and
a command monitor.
Command routinesare used to implementall functionswhich the Controller
may direct the microcomputersto perform. Such functionsincludedownloading
object code and data, establishingprocessorconnectivity,executingprograms,
and uploadingresults. Severaldebug commandsare also availableto read and
modify memory locations,inspectregisters,set breakpoints,and step through
instructionexecution. The philosophyof Nodal Exec is to provide sufficient
functionalitywith a set of relativelysimple commandsso that the Controller
softwarecan combinethem into a sophisticateduser interface.
A libraryof Pascal-callablesubroutines,PASLIB,providessupport facil-
ities for applicationprograms. The PASLIB routinesare essentiallyan exten-
sion to Nodal Exec, servingas high-levelsupervisorcalls or, in some cases,
interfacingdirectly%o hardware functions. They allow user programs to com-
municatewith other processorsand with the Controller,to use the flag and
sum/maxnetworks,to access data areas, and to performarithmeticusing the
floating-pointprocessor. Frequentlyusedmathematical subroutines(e.g.,
vector dot product)are also availablewhich use the stack architectureof the
floating-pointunit to optimizeperformance. The most commonlyused PASLIB
routinesare stored in ROM on the processors. The remainingroutines reside
in a library file on the Controllerwhere they can be linkedto user programs
and downloadedwith the object code.
Nodal Exec and PASLIB supportthree major conceptswhich are importantin
understandingthe flow of data on FEM: data areas, connectivity,and inter-
processorcommunication. Explanationsof each of these conceptsare presented
in the subsequentparagraphs.
Data areas are the primarymechanismfor transferringdata betweenthe
Controllerand the applicationprograms runningon the processorsin the
Array. Data areas requiredby an applicationare allocatedin each proces-
sor's memory prior to program execution. They containspace for a specified
number of data items of a particulartype. Allowabledata types are integer,
long integer,real, double precision,or user-definedrecords. Once alloca-
ted, a data area can be filledwith data from the Controller,initializedto
some value, or left empty. Applicationprogramsreferencedata areas via
pointervariables. Data areas can provide input to the program, receiveout-
put, or both. Since data areas exist independentlyof the programswhich
access them, they can be used to pass informationbetween separateprograms
which execute in a series. When a program (or seriesof programs)is
finished,resultsstored in data areas are uploadedto the Controllerfor file
storageor post-processing.
Connectivity is the conceptof establishingcommnnicationpaths between
programsexecutingon differentprocessors. Connectivitymay be viewed at two
levels,the logicalproblem level and the physicalprocessorlevel. Logical
connectivityrefersto the interconnectionsbetweennodes in a structureby
virtue of the fact that the nodes lie on the same finite element. Physical
connectivityrefersto the physical I/O interconnectionsbetweenprocessors.
For simple or regularstructures,mappingthe logicalinterconnectionpattern
onto the planar mesh of physicalprocessorsmay be straightforward. In
general,however,the mappingproblemis difficultand the local neighbor
connections are insufficient; therefore, the g!obal bus must be used. Bokhari
has addressed this problemand developed an algorithm which attempts to maxi-
mize the use of the local links (ref. i0). This algorithm is implemented on
the Controller by an auxiliary program whose output is the logical-to-physical
mapping of node numbers to processor numbers.
Interprocessor communication takes place via the local and global I/O
paths which were enabled during the connectivity process. Communication is
based on the transmission of records which contain from one to 255 data
words. An associated tag word in the record header is used to distinguish the
information content when mu!tiple records are sent to the same processor.
Interprocessor communication is handled either synchronously or asynchronously
by the system software. If synchronous mode is used, input from a neighboring
processor is queued in the order in which it is received, and it mnst a!l be
read and processed by the receiver. In the asynchronous case, only the most
recently received record (for each different tag) is saved. An algorithm
which uses this asynchronous or "chaotic" communication technique is discussed
in the next section.
CURRENTALGORITHMSAND APPLICATIONS
To solve structural problems in parallel requires the development of
algorithms to support parallel computations and a scheme to partition the
structural model for distribution among the processors. The following section
discusses the assignment of problems to the Array, and gives results from
several applications run on the four-processor version of FEM.
Problem Partitioning
Factors that influence the design of an appropriate algorithm for solving
problems on FD4 include the structural region discretization, the number of
processors available, and the amount of communication required between proces-
sors. The following example illustrates these considerations.
Figure 6a shows a cantilevered rectangu!ar plate in plane stress con-
strained on one edge and loaded on the opposite edge. If the plate is discre-
tlzed by linear triangular finite elements, a structural node is common to at
most six elements, and is connected to at most six other nodes (see figure
6b). This is significant because it implies that in the system of equations
for the vector of displacements, u, the stiffness matrix, K, is a sparse
matrix containing at most 14 nonzero entries in each row:
K u : f (i)
Twelve of these entries represent contributions of the six neighboring nodes
(two per node) to the solution at a given node while two additional entries
are contributions from the given node itself.
6
The sparsity of the stiffness matrix, K, suggests that an iterative
algorithm could be used to solve equation (1) by assigning one processor to
calculate displacements at each node in the plate. For maximum efficiency, an
algorithm should be developed such that each processor would only need to com-
mnnicate information to its six neighbors via the dedicated local links, as
shown in figure 6c. This scheme is feasible only if the number of processors
" is not less than the number of Structural nodes and if a suitable iterative
algorithm can be found to take advantage of the connectivity shown in figure
6c. Furthermore, such an algorithm is efficient only if the overhead due to
communication between processors is not prohibitive.
In most instances, the number of structural nodes exceeds the number of
available processors. For these cases, it is necessary to assign multiple
nodes to a processor and to develop algorithms to solve for the displacements
at these nodes. Figures 6d and 6e show, respectively, how nodes of the plate
can be assigned to a 4- or 16-processor Array. The local links that are used
by the processors in figures 6d and 6e are illustrated in figures 6f and 6g.
Even though FEMwas designed with finite element discretizations in mind,
the architecture also supports the solution of problems that are discretized
by finite difference techniques. Two such problems and their associated dis-
cretizations are given in figure 7. The five star discretization of the mem-
brane equation is shown in figure 7a and the discretization for the plate
equation is given in figure 7b. For a one node per processor assignment, the
iterative solution of the membrane equations using the discretization of
figure 7a requires four local links of each processor while the iterative
solution of the plate equation using the discretization of figure To requires
all twelve of the local links. In the case of multiple nodes per processor,
the solution algorithm determines the proper assignment of nodes to proces-
sors. In the following, algorithms that have been run on FEM for both finite
element and finite difference discretizations are discussed.
FEM Applications
Solution algorithms for the first applications run on FEM used three
standard iterative methods: Jacobi, conjugate gradient, and successive over-
relaxation (SOR). These methods contain suitable parallelism and were used to
solve sparse symmetric positive definite systems of linear structural equa-
tions resulting from finite element or finite difference discretizations.
Smith and Loendorf (ref. ii) solved a cantilevered wing box finite
element model using the basic Jacobi iterative method. This small problem
provided a useful benchmark for assessing a number of performance issues.
- Their results for one, two, and four processors show that the increased
overhead for interprocessor communication largely offset the improvements
gained by distributing the computation, thereby resulting in only modest
reductions in the solution time. Their analysis suggests that there is a
break-even point beyond which additional partitioning of a problem is
ineffective. The results from this problem have also prompted a re-thinking
of processor communication strategies and several modifications have been
proposed to reduce overhead.
The same problem was also solved using an asynchronous Jacobi iterative
method (see Baudet, ref. 12) in which each processor performs its calculations
independently with no synchronization among processors. Intermediate results
were passed between processorsusing the asynchronous communication mode dis-
cussed previously. The asynchronous Jacobi algorithm was run using two and
four processors, and the results were inconclusive. In both cases, the
asynchronous method required less time per iteration than the standard Jacobi
technique, and the program converged to results which were similar to those of
the standard Jacobi. However, the number of iterations required for conver-
gence differed, with the two-processor case using about the same number as the
standard Jacobi, and the four-processor case using more. The result was that
for two processors, the asynchronous method slightly outperformed the standard
Jacobi, but with four processors, the reverse was true. Further experimenta-
tion with modified communication procedures and other application problems is
needed to better assess the asynchronous approach.
While the Jacobi iteration can be easily adapted as a parallel technique,
it is not guaranteed to converge for general symmetric positive definite sys-
tems. The SOR method is guaranteed to converge for these systems, but is
sequential in nature. To parallelize the successive overrelaxation method for
FEM, the problem must be partitioned in such a way that the system is decou-
pled. A classical method of decoupling is the Red/Black ordering (ref. 13)
for Laplace's equation. This procedure colors the discretization grid in a
checkerboard fashion. Then an SOR sweep can be carried out by two Jacobi
iterations, one on the equations corresponding to the red points, and one on
the equations corresponding to the black points. This strategy does not work
for higher order finite difference or finite element discretizations, however,
because two colors are insufficient to decouple the system. Adams and Ortega
(ref. 7) have developed a new iterative method that they call "Multi-Color"
SOR which is a generalization of the Red/Black ordering. In Multi-Color SOR,
an ordering is imposed on the sequence in which the displacements at the nodes
are calculated, based on the number of colors required for decoupling. For
example, if three colors (red, black, green) are used, the displacements for
each color can be calculated by the processors in parallel. Each iteration of
the algorithm first computes all red values, then all black values, and
finally all green values. This scheme allows SOR to be implemented as a mul-
tiple sweep (one for each color) Jacobi-type method on FEM.
To test this method, Laplace's equation was solved on a square region
discretized by quadratic triangular finite elements for which six colors are
necessary and sufficient to color the discretization. This six-color SOR
algorithm was programmed on a minicomputer to test its convergence properties
and on the four-processor FEM to test its suitability for parallel implementa-
tion. A comparison showed that the problem converged with identical results
on both mchines.
A plane stress analysis of a plate was used to compare the Multi-Color
SOR algorithm to the standard conjugate gradient method. The computer program
for this procedure can be used to solve large plate problems by assigning
three structural nodes to each processor or by assigning any multiple of three
nodes to each processor if the number of processors is limited. The
components of the program include the following:
I. Parallel assembly of stiffness matrix K
2. Three-color SOR solution of K u = f
or alternatively,
Conjugate gradient solution of K u = f
" 3. Parallel stress calculation
The Array can be used to assemble, in parallel, the stiffness matrix from the
problem data without any communication between processors. Linear triangular
finite elements are used to discretize the plate so that three colors are
necessary and sufficient to implement S0R (see ref. 7). The calculation of
the stresses can also be done in parallel without any processor communica-
tion. A more detailed description of the matrix assembly and the stress cal-
culation processes is given in reference 14.
A comparison of the performance of four processors to one processor on a
plane stress problem with 60 degrees of freedom is given in table 1. These
speedups reflect the execution times of both the solution algorithms and the
underlying system software. The maximum theoretical speedup for a four-
processor system is 4.00. The processor efficiency values given are a measure
of the overhead required for synchronization and communication in the mclti-
processor case. Improved interprocessor communication times should increase
the efficiency of these algorithms on FEM.
FUTURE DIRECTIONS
The solution of the plane stress analysis of a plate on FEM was felt to
be a good starting place to address the issues of parallel matrix assembly,
parallel displacement calculation, and parallel stress calculation. The
experience gained by solving this problem provides a basis for the solution of
more complex structural problems.
Although the initial applications of FEM have been based on iterative so-
lution approaches, Cannon (ref. 15) has demonstrated that the architecture is
sufficiently flexible to permit direct solution techniques. The study of such
techniques on FEM is a major research area currently being investigated.
In conjunction with algorithm development, alternative processor inter-
connection strategies may also be investigated. To date, the local links have
only been configured in an eight-nearest-neighbor planar mesh topology with
toroidal wrap-around at the edges. This scheme leaves four of the links un-
used. Since the local links can be reconflgured by merely unplugging and re-
- arranging the interconnecting cables, other topologies such as trees, rings,
perfect shuffles (ref. 16), or cube-connected cycles (ref. 17) are possible.
Because of the relatively large number of links per processor, it would even
be possible to implement multiple interconnection patterns simultaneously
(e.g., eight-neighbor mesh plus shuffle-exchange). The development of
algorithms to make efficient use of alternate topologies is another topic for
research.
9
The current FEM hardware is viewed as an experimental device rather than
as a production machine. Data management considerations and an analysis of
hardware and software performance are expected to point to changes in the
architecture which could be incorporated in a second generation FEM. Such a
machine would undoubtedly benefit from continuing advances in VLSI circuit
technology which would improve performance and reduce the size and cost of
components.
The potential range of FEM applications is not limited to structural
analysis problems, although they provided the original motivation. By design-
ing a parallel archite:ture thought to be suitable for finite element analy-
sis, a flexible, reconfigurable machine has resulted which can emulate several
distinct computer architectures. FEM could therefore be useful for parallel
algorithms research in a number of disciplines.
CONCLUDINGRD4ARKS
The Finite Element Machine is providing the impetus for development of
new parallel techniques for the solution of structural analysis problems.
Several of these techniques have been implemented and evaluated on the FEM
hardware, thereby demonstrating that parallel solution of structural problems
is a feasible approach. Initial results show that there can be a significant
execution time advantage to be gained by using multiple cooperating proces-
sors. The degree of speedup is dependent on the number of processors, the
algorithms used, and the extent to which the ratio of computation to inter-
processor communication is maximized.
The work done so far has Just begun to address the algorithms and appli-
cations suitable for investigation on FEM. Expansion of the machine to 16 and
36 processors will provide a research testbed for the exploration of many
issues relating to parallel processing. It is hoped that the results of this
research can be applied to future production computers which will benefit not
only structural engineers, but other users as well.
REFERENCES
i. Loendorf, David D.: Advanced Computer Concepts for Engineering Analysis
and Design, Ph.D. thesis (in progress), University of Michigan, Ann
Arbor.
2. Jones, A. K.; and Schwarz, P.: Experience Using Multiprocessor
Systems--A Status Report. Computing Surveys, Vol. 12, No. 2, June
1980, pp. 121-165.
3. Rieger, C.; Trigg, R.; and Bane, B.: ZMOB: A New Computing Engine for
AI. University of Maryland, TR-1028, March 1981.
h. Gottlieb, Allan; et al.: The NYU Ultracomputer--Designing a MIMD,
Shared-Memory Parallel Machine. Proceedings of the Ninth Annual
Symposium on Computer Architecture, April 1982, pp. 27-42.
l0
5. Jordan, Harry F.: A Special Purpose Architecture for Finite Element
Analysis. Proceedings of the 1978 International Conference on
Parallel Processing, August 1978, pp. 263-266.
6. Jordan, Harry F.; and Sawyer, Patricia L.: A Multi-Micr0processor
System for Finite Element Structural Analysis. Trends in Computerized
Structural Analysis and Synthesis, A. K. Noor and H. G. McComb, Jr.,
- Editors, Pergamon Press, Oxford, 1978, pp. 21-29.
7. Adams, L.; and Ortega, J.: A Multi-Color S0R Method for Parallel
Computation. Accepted for presentation at the 1982 International
Conference on Parallel Processing, August 1982.
8. Jordan, Harry F.; Scalabrin, M.; and Calvert, W.: A Comparison of Three
Types of Multiprocessor Algorithms. Proceedings of the 1979
International Conference on Parallel Processing, August 1979, PP.
231-238.
9- Jordan, Harry F., Ed.: The Finite Element Machine Programmer's
Reference Manual. Computer Systems Design Group, University of
Colorado, Boulder, 1979.
10. Bokhari, Shahid H.: On the Mapping Problem for the Finite Element
Machine. Proceedings of the 1979 International Conference on Parallel
Processing, August 1979, PP. 239-248.
ll. Smith, Connie U.; and Loendorf, David D.: Performance Analysis of
Software for an MIMD Computer. Report CS-1982-7, Department of
Computer Science, Duke University, 1982.
12. Baudet, G.: Asynchronous Iterative Methods for Multiprocessors.
Journal of the ACM, Vol. 25, April 1978.
13. Young, D.: Iterative Solution of Large Linear Systems, Academic
Press, 1971.
lb. Adams, L.: Iterative Algorithms for Parallel Computers. Ph.D.
Dissertation (in progress), University of Virginia.
15. Gannon, Dennis: A Note on Pipelining a Mesh Connected Multiprocessor
for Finite Element Problems by Nested Dissection. Proceedings of the
1980 International Conference on Parallel Processing, August 1980,
pp. 197-204.
16. Stone, H. S.: Parallel Processing with the Perfect Shuffle. IEEE
• Transactions on Computers, Vol. C-20, No. 2, Feb. 1971, pp. 153-161.
17. Preparata, F. P.; and Vuillemin, J.: The Cube-Connected Cycles: A
Versatile Network for Parallel Computation. Communications of the
ACM, Vol. 4, No. 5, May 1981, pp. 300-309.
ll
Tablei. SpeedupRatiosforthe PlaneStressProblem
on Four Processorsvs.One Processor
Algorithm Speedup Processor Efficiency
Stiffness Matrix Assembly 3.20 80%
3-Color S0R (K u = f ) 2.8h 71%
Conjugate Gradient (K u = f ) 2.82 71%
Stress Calculation h.00 100%
12






I I I "\ ¢ _




Figure 2.- Finite element machine block diagram.
mm
Figure 3.- Prototype finite element machine hardware.
14
Or iginal shape
........ Deformation after loading
Figure 4.- Graphic display of nodal displacement.
Control ler
Nei ghbor $ NODAL Ne i ghbor s
EXEC
PASL [B
Subr ouf i ne D i agno s f i c s
Library
. . ystem Software
Appl icat ion
Programs
Figure 5. Processor software configuration.
15
!
a) . Plate under load,
b| . Finite olen'_nt dis€rot Lzat Ion. ¢ | . Six ne|ghboring
node_.
dl . Four procecmor aellgnmont . • | . Sixteen proceceor assignment.
f | . Connect Ivl ty for g) . Connectivity for
four procoeoore. Ilxteon proceeeorw.
Figure 6.- Plane stress analysis of a plate.
16
• • • • •
- • • • •
-. 1_ e,, •
• _ \ 7 ,w
• • • •
• • • • •
a ) . Membrane Problem
, 1_72u - O)
• • • •
• q •
• • q • •
bl. Plate Be'nding Problem
IV4u - O)




1. Report No. 2. Government Accession No. 3. Recipient's Catalog No.
NASATM- 84514
4. Title and Subtitle 5. Report Date
THE FINITE ELEMENTMACHINE: An Experiment in July 1982
Paral 1el Processing 6. PerformingOrganizationC de505-33-63-01
7. Author(s) 0. 0. Storaasli and S. W. Peebles, NASALaRC s. PerformingOrganizationReportNo.
T. W. Crockett and J. D. Knott, Kentron
L. Adams, University of Virginia
10. Work Unit No.
9. PerformingOrganizationNameand Address
NASA Langley ResearchCenter '11.Contractor Grant No.
Hampton, VA 23665
13. Type of Report and Period Covered
12. Sponsoring Agency Name and Address Technical Memorandum




The Finite Element Machine at the NASALangley Research Center is a prototype
computer designed to support parallel solutions to structural analysis problems.
This paper describes the hardware architecture and support software for the machine,
initial solution algorithms and test applications, preliminary results, and
directions for future work.
17. Key Words (Suggested by Author(s)) 18. Distribution Statement '
parallel processing Unclassified - Unlimited
Subject Category 62
19. Security Cla_if. (of this report) 20. SecurityClassif.(of this page) 21. No. of Pages 22. Price
Uncl assi fied Uncla_i fipd 18 A02
,-3os ForsalebytheNationalTechnicalInformationService.Springfield,Virginia22161

!i
IP
