A parallel adaptive dynamic relaxation (ADR) algorithm has been developed for nonlinear structural analysis. This algorithm has minimal memory requirements, is easily parallelizable and scalable to many processors, and is generally very reliable and efficient for highly nonlinear problems. Performance evaluations on single-processor computers have shown that the ADR algorithm is reliable and highly vectorizable, and that it is competitive with direct solution methods for the highly nonlinear problems considered.
INTRODUCTION
Use of the finite element method to solve structural problems of increasing computational size and complexity continues to be the focus of intense research. Yet even on current high-speed vector computers, solution costs, especially for transient dynamic analyses, are often prohibitive. Emerging high-performance computers offer tremendous speedup potential for these types of applications, provided an optimal solution strategy is implemented. Existing sequential solution procedures may be adapted to operate on these computers. However, these procedures have been developed and customized for sequential operation and may not be the best approach for parallel processing. To exploit this potential fully, problem formulations and solution strategies need to be re-evaluated in light of their suitability for parallel and vector processing. As such, the overall goal of this research is to develop an adaptive algorithm for predicting static and dynamic response of nonlinear hyperelastic structures which exploits these emerging high-performance computing The basic formulation for the adaptive dynamic relaxation (ADR)-algorithm for hyperelastic structures is given by Oakley and Knight [1] . Dynamic relaxation is a technique by which the static solution is obtained by determining the steady-state response to the transient dynamic analysis for an autonomous system. In this case, the transient part of the solution is not of interest, only the steady-state response is desired. Since the transient solution is not desired, fictitious mass and damping matrices which no longer represent the physical system are chosen to accelerate the determination of the steady-state response. These matrices are redefined (using existing equations) so as to produce the most rapid convergence. For highly nonlinear problems where stiffness changes significantly during the analysis, adaptive techniques exist which automatically update the integration parameters when necessary [6] .
An ADR algorithm represents a unified approach for both static and transient dynamic analyses, and is known to be very competitive for certain problems with high nonlinearities and instabilities [6, 7, 8] . Reliability is ensured by integration parameters which are adaptively changed throughout an analysis to accommodate these nonlinear effects. It is based on an explicit direct-time integration method and a very small time step is generally required to ensure numerical stability; however, the computational cost per time step is very low and is mostly associated with evaluation of the internal force vector.
The present paper builds on a study which was begun to evaluate the potential of the ADR algorithm for solving complex structural problems on parallel-vector computers. The formulation of an ADR algorithm with application to nonlinear hyperelastic structures is presented in reference [1] including a complete derivation of the algorithm and the problem adaptive scheme used to ensure reliability and improve performance.
Finite element equations are derived for the nonlinear analysis of elastic and hyperelastic solids subject to large deformations. A very simple and efficient algorithm based on solver constraints is developed to enforce frictionless contact conditions. The performance of a sequential implementation of the ADR algorithm is evaluated in reference [2] using a Convex C240 minisupercomputer. A new organization of the finite element computations is implemented to exploit vector processing. Two-and three-dimensional test cases are used to assess the analysis and performance capabilities of the algorithm. Performance of the ADR algorithm for the nonlinear static analysis of each test case is compared^with that of an existing finite element code which employs the Newton-Raphson procedure and a highly optimized Cholesky skyline solver. Relative speedups due to vectorization are also presented. This algorithm is found to be reliable and highly vectorizable, and it outperforms the direct solution method for the highly nonlinear problems considered.
The performance of a parallel implementation of the ADR algprithm is evaluated in reference [3] on a cluster of 16 Sun SPARC IPX workstations using PVM [10] and on a 128-processor Intel iPSC/860 hypercube. The parallel implementation is designed such that each processor executes the complete sequential algorithm on a subset of elements. One-dimensional strip partitioning and two-dimensional block partitioning are used to divide the problem domain among the available processors. Load balancing is ensured by using structured meshes of one material. Efficient schemes are implemented to accomplish the required nearest-neighbor and global communication. Relative speedups are presented for the nonlinear static analysis of the 2-D and 3-D test cases developed and analyzed in reference [2] . Impressive relative speedups are achieved on the Hypercube and demonstrate the high scalability of the ADR algorithm. Good relative speedups are also achieved using PVM on a cluster of networked workstations.
The objective of this paper is to evaluate the performance of a massively parallel implementation of the ADR algorithm using the 512-processor Intel Touchstone DELTA system. Relative speedups are presented for the test cases evaluated in reference [3] . Final results and completion times for each test case are also given.
The remainder of this paper is organized as follows. In Section 2, the formulation and sequential implementation of the ADR algorithm are reviewed, then the general parallel-processing approach is described.
In Section 3, the Intel Touchstone DELTA system is described, a general flowchart for the complete parallel algorithm is presented, and interprocessor communication details are discussed. In Section 4, the test cases are reviewed and performance results are presented and evaluated. Conclusions are given in Section 5.
PARALLEL-PROCESSING APPROACH
This section begins with a review of the basic formulation and sequential implementation of the ADR algorithm. Afterwards, the general parallel-processing approach is described. The partitioning strategy is presented, load balancing characteristics are discussed, and details of the parallel solution process are given.
Formulation of the ADR Algorithm
The ADR algorithm is based on the following semi-discrete equations of motion governing structural dynamic response for the n" 1 time increment " MD+ CI)+ F(D") = P" (2.1) where M is a diagonal mass matrix, C is a damping matrix, F is the internal force vector, and P is a vector of external loads. The vectors D,D, and D represent the acceleration, velocity, and displacement vectors, respectively. The internal force vector F is a function of the displacements and may be assembled on an element-by-element basis [1] .
The ADR algorithm involves the use of an explicit numerical time integration technique to solve equation (2.1). In the current algorithm, a half-station, central-difference technique is used which provides the following approximations for the temporal derivatives to produce the most rapid convergence, where convergence herein is based on a relative error of the force imbalance or _ HP -F"|| _ HR-II ( -ypii -w -"" ( } where P is the static load and R" is the residual force vector for time step n. Note that when convergence (i.e., steady-state response) is obtained, the internal forces balance the external forces, and the inertial forces vanish.
A derivation of the fictitious mass and damping equations for the ADR algorithm is given in reference [1] . Based on Gerschgorin's theorem, this new mass matrix is denned as M » * T £ lKijl or M -T s The quantities fc,-; -in this expression correspond to the entries in the element stiffness matrix ke (see reference [1] ). As shown in equation (2.8), the fictitious mass M and time step size h are not independent. However, once either is specified, the other value may be readily computed. Herein, the time step size is arbitrarily set to one. Numerical experiments with other values (e.g., 10, 100, 1000) confirm that this choice is arbitrary and no change in algorithm performance or convergence characteristics is observed. The new damping coefficient c is computed using
The parameter X 0 represents the lowest eigenvalue estimated from the mass-stiffness Rayleigh quotient shown.
The vector S" is a diagonal estimator -of the directional stiffness after step n, the components of which are given by
The new damping coefficient c is updated every time step since it only involves minimal computations.
The new mass matrix must be evaluated at the beginning of the analysis, and in addition, subsequent updates are sometimes necessary to maintain numerical stability. Stability is determined using the perturbed apparent-frequency error indicator [6] where values of e, greater than one represent potential unstable numerical conditions. This error indicator is given by
Solution Process
A flowchart of the sequential implementation of the ADR algorithm is shown in Figure 1 . Contact conditions are also enforced in this phase using the solver-constraints technique described in reference [1] . If unstable conditions exist as determined from equation (2.12), istif is reset to 1. Phase 4 evaluates the force imbalance and convergence.
The scalar quantity ||R|| is computed and then the error is determined using equation (2.7). The scalar quantity ||P|| in equation (2.7) is constant for the test cases considered herein and is evaluated once at the beginning of the analysis.
Parallel Implementation
The ADR algorithm is very amenable to parallel processing. All vector quantities needed for the solution step may be computed on an element-by-element basis. The calculations required for each element are completely independent, and the system of equations for computing the displacement solution at the next time step are completely uncoupled. The primary objective for developing a parallel implementation is to maximize performance (in terms of relative speedup) by minimizing the frequency and extent of interprocessor communication and maximizing load balancing. To accomplish this objective, the ADR algorithm is parallelized by having each processor execute the complete sequential algorithm on a subset of elements.
Two different partitioning schemes are used to divide the problem domain among the available processors. The first is referred to as one-dimensional strip partitioning and views the system topology as a 1-D array of m processors. As described in reference [9] , the finite element mesh is partitioned along the longest element-wise dimension into m strips, and the elements within each strip are assigned to a particular processor. The strip partitions of a cantilever beam test case are illustrated in Figure 2a for the case of 16 processors. The 1-D array view of the" system topology is shown in Figure 2b , and the corresponding processor assignments for each partition are indicated. The test cases used in this research are very amenable to this type of partitioning as they all have one dominant dimension in terms of the number of elements in one direction. This method is relatively simple to implement and, as shown in Figure 2a , limits the neighbors of each partition to two.
One-dimensional strip partitioning is only feasible if the maximum number of available processors is less than or equal to the number of elements along the longest element-wise dimension. If a large number of processors are used such that this condition is no longer satisfied, then a second partitioning scheme, referred to as two-dimensional block partitioning, becomes necessary. This method views the system topology as an n x m array of processors, where n is the number of rows of processors and m is the number of processors per row. As described in reference [9] , the finite element mesh is first partitioned along the longest element-wise dimension into n strips. Each strip is then partitioned along its length into m blocks, and the elements within each block are assigned to a particular processor. The block partitions of a cantilever beam test case are illustrated in Figure 3a for the case of 16 processors. In this example, the system topology is viewed as a 4x4 array of processors as shown in Figure 3b . It can be seen from Figure 3a that with this method, each partition has at most eight surrounding neighbors.
Balancing the workload among the processors of a concurrent computer is central to achieving high performance. Overall relative speedup is generally limited by the maximum CPU time required for any processor to complete its assigned work. Since identical processors are typically available oh a given parallel processing system, the amount of computations or work assigned to each processor needs to be nearly the same in order to avoid processors being idle. Thus, an important consideration is to ensure that each processor performs the same amount of work. As noted in Section 1, only structured meshes of uniform size, type, and material are considered herein. Therefore, load balancing is primarily a function of the number of elements assigned to each processor. As such, the partitioning is performed so that each processor receives the same number of elements. A more sophisticated mapping or partitioning strategy (e.g., domain decomposition) would have to be employed to achieve sufficient load balancing with an unstructured mesh or if different materials are involved in the same finite element model.
A flowchart of the solution procedure executed on a given processor is shown in Figure 4 . All computations are now for the elements assigned locally to that processor. As shown, the algorithm is very similar to the sequential solution procedure given in Figure 1 with just a few important differences related to interprocessor communication.
Nearest-Neighbor Communication
One nearest-neighbor communication sequence is required for each time step (in Phase 1 of Figure 4 ).
This may be illustrated in reference to Figures 2 and 3 which show the finite element mesh for a cantilever beam partitioned among 16 processors. Considering any two adjacent partitions, nodes along the interface are shared by elements that are allocated to two different processors. As a result, the internal force vector F and lumped stiffness matrix S computed for the interface nodes on a given processor are only partially complete. Contributions must be obtained from the neighboring processors before the solution process can continue.
For 1-D strip-partitioned domains, each partition may have a left and right neighbor. The required nearest-neighbor communication can be accomplished in two steps as follows. In the first step, each processor sends F and S values for the left interface to its left neighbor and then receives the corresponding values being sent from its right neighbor as shown in Figure 5a . The second step consists of repeating this process in an analogous fashion for the right interface nodes (see Figure 5b ). The end result is an exchange of the interface values between neighboring processors. This approach synchronizes the communication sequence among the processors, thereby improving efficiency and avoiding the possibility of deadlock (e.g., two processors sending different variables simultaneously, and each waiting for the other's variable to be sent before proceeding).
For 2-D block-partitioned domains, each partition may have eight surrounding neighbors. Even so, the nearest-neighbor communication can be accomplished in four steps [4] . First, the left and right interface values are exchanged in the horizontal direction as shown in Figure 6a . This is accomplished using the twostep sequence described earlier for strip-partitioned domains. After each processor has updated its left and right interface values with contributions received from its horizontal neighbors, the upper and lower interface values are exchanged in the vertical direction using the same two-step sequence as before (see Figure 6b ). where np refers to the number of processors and ndof denotes the degrees of freedom associated with the elements on a given processor (it is understood that contributions from the interface degrees of freedom are included only once).
Each processor must use the same value of the damping constant c, and this value must be the same as that computed in the sequential solution (i.e., it must be based on global D,S, and M vectors). To accomplish this, the following strategy is used. Each processor computes the local vector products A and B given by equations (2.13) and (2.14), respectively. A global sum of A and B is then performed across all processors such that they all end up with the same complete values for D T SD and D T MD. The correct damping coefficient may then be computed and utilized by each processor.
A second global communication sequence is needed after computation of the new local displacements (i.e., during Phase 4). Several events must occur at this point as discussed for the sequential implementation.
Convergence must be evaluated based on the current residual force imbalance error, and flags must be set to control continuation of the solution process and updates of the fictitious mass.
The error computation given by equation (2.7) needed to assess convergence is similar to that of the damping coefficient. It must be based on global system response, and the final result is needed by each processor. In this case, the same global summation procedure is used to supply each processor with complete identical values for the vector product R T R. As mentioned earlier, the quantity P T P is constant for static analysis, therefore it is computed once and provided as input for each processor.
Flags tend and istif represent local conditions on each processor, however the actions they initiate must be performed by all processors. If one processor detects the need for a fictitious mass update, all processors must perform this update. Similarly, if collapsing elements occur on one processor, all processors must exit the solution process. This information is also communicated by a global sum procedure. The appropriate actions are then taken by each processor if the final summation result for each flag is greater than or equal to one.
Summary
In summary, the primary difference between the sequential and parallel solution process is that the parallel algorithm requires one nearest-neighbor and two global communication sequences each time step.
For a given partitioning scheme, the nearest-neighbor communication is independent of the number of processors but is dependent on problem size as larger problems would imply more interface nodes. The global communication is independent of problem size, since it only involves scalar quantities, but is dependent on the number of processors. For large problems, the global communication time should represent only a small fraction of the total communication cost. Additional implementation details for nearest-neighbor and global communication is topology dependent and will be discussed in Section 3.
PARALLEL ALGORITHM
The parallel ADR algorithm is developed for implementation on the Intel Touchstone DELTA system referred to herein as simply the DELTA. The unvectorized code is used as the baseline code. Vectorization can be accomplished on the DELTA; however, it is not as straightforward to implement as on a Convex computer [2] . Thus, the effect of vectorization on parallel performance will not be investigated in this paper.
This section begins with a description of the DELTA. Afterwards, a general description of the complete parallel algorithm is given and communication details are discussed.
Intel Touchstone DELTA System
The Intel Touchstone DELTA system is a multiple-instruction multiple-data (MIMD) type computer.
The processor interconnection network represents a two-dimensional mesh topology rather than a hypercube topology and is illustrated for 12 processors in Figure 7 . This architecture is advantageous for large finite element computations in two ways. First, it is scalable (i.e., its communication performance does not degrade as the number of processors is increased). This feature is important as the size of finite element problems continues to increase, requiring more and more processors to achieve results in a reasonable time frame.
Secondly, its use of local memory alleviates many of the problems and losses in efficiency associated with memory contention on shared-memory computers.
The DELTA system consists of a total of 512 numeric processors. Each processor is an Intel i860 processor with a 40 MHz clock speed and 16 megabytes of local nonshared memory. A peak performance of 60 MFLOPS per processor is attainable for double-precision floating-point computations. A diagram of the system is shown in Figure 8 . The parallel processors are configured in a 16 x 32 array with a two-dimensional mesh interconnection network. In contrast to the hypercube interconnection topology in which each processor is directly connected to n neighbors for an n-dimension cube, with the mesh topology of the DELTA, each processor is directly connected to at most four neighbors regardless of the total number of processors being used (see Figure 3 .11). This results in a lower interprocessor connectivity than the hypercube topology.
Interprocessor communication is accomplished with mesh routing chips which enable messages to go directly to the receiving processor without interrupting any of the others. The mesh routing chips are designed to be faster than the routing modules of the Intel iPSC/860 hypercube, in order to make up for the additional distance messages may have to travel due to the lower connectivity of the mesh topology. The DELTA system has two gateway processors Deltal and Delta2 which act as interfaces between the parallel processors and the ethernet network. These two systems may be viewed as two Unix systems on the network, and they serve as a base from which applications may be run on the system. The system is networked to remote workstations which are used for editing, compiling, and linking the parallel algorithms and also for any preand post-processing.
Complete Parallel Algorithm
A general flowchart for the complete parallel algorithm is shown in Figure 9 . It consists of sequential pre-and post-processing programs which run on a remote workstation and a node program which runs on each of the parallel processors. The pre-processing program reads the input data required for the analysis and prepares this data for the parallel solution process. The input data is first mapped into the appropriate arrays needed for finite element computations. The resulting information is then partitioned and written out in the form of an input file for each processor. These files are transferred to one of the two gateway processors Deltal or Delta2. From the gateway processor, an n x m array of parallel processors is allocated, and the node program is loaded and started on each. On a given processor, the node program reads its input data file and then cycles through the ADR solution process for its subdomain of elements. When the solution process is complete, each processor writes its local results to a file. These result files are then transferred back to the remote workstation, and the post-processing program is executed. The post-processing program reads the result files for each processor, assembles the global results, and then writes these results to a file.
Nearest-Neighbor Communication
Depending on the partitioning method being employed, nearest-neighbor communication is accomplished using one of the two message-passing sequences described in Section 2. To implement this communication effectively on the DELTA, the domain-to-processor mapping must account for the number of rows and columns of processors to be utilized and the associated numbering scheme. The DELTA permits allocation of an n x m array of processors, where n is the number of rows of processors (1 < n < 16) and m is the number of processors per row (1 < m < 32). The numbering scheme is illustrated in Figure 7 . As shown, processors are numbered sequentially from left to right in each row, starting at the upper left-hand corner of the n x m array.
For 1-D strip-partitioned domains, the mapping strategy may be illustrated by considering a cantilever beam to be analyzed using 16 processors. If a single row of 16 processors is allocated as shown in Figure 2b , the processors are assigned to each strip in consecutive order as illustrated in Figure 2a . To illustrate the mapping strategies employed in conjunction with 2-D block partitioning, consider again a cantilever beam to be analyzed using 16 processors. If a 4 x 4 array of processors is allocated (see Figure   3a ), the beam is normally partitioned into a 4 x 4 array of blocks, and processors are assigned as indicated in Figure 3b 
Global Communication
As already mentioned, two stages in the ADR algorithm require global sums, representing the addition of partial sums located on each processor. This operation may be easily and efficiently accomplished by using the high-level global-sum construct which the DELTA provides. This construct takes advantage of the two physical links between connected processors to overlap communication in two directions. For an n x m array of 2 P processors, its application yields the complete sum on each processor after only p communication steps.
The global sum process may be illustrated in reference to the 2x2 mesh shown in Figure 12 . In the first step, partial sums are simultaneously exchanged between processors PQ and PI, and between processors PI and Pa. The second step is analogous to the first except now results are exchanged between processors PQ and P-2, and between processors PI and PS. Thus, after two communication steps the partial sums are present on all four processors.
NUMERICAL RESULTS
First, a description of the test cases used for evaluating performance. Next, an overview of the evaluation procedure is given. The primary factors which affect performance are then reviewed. Finally, parallelprocessing results for the DELTA are discussed.
Test Cases
Thirteen test cases representing both straight and curved 2-D and 3-D geometries with elastic and hyperelastic materials are used to evaluate performance. They were developed and analyzed in reference [2] and are intended to represent some of the problems which occur in tire modeling and analysis. As such, they are designed to include contact, large deformations, and nonlinear hyperelastic materials. The key features of each test case are summarized in Table 1 . Additional modeling details, including contact conditions, may be found in reference [2] .
The straight test cases correspond to the 2-D plane stress and 3-D analysis of elastic and hyperelastic cantilever beams subjected to tip loading and frictionless contact with an inclined surface (see Figure 13 ).
Two discretizations are considered for each problem -one with 1024 elements and another with 8192. The discretization is uniform such that all elements in a given mesh are the same size. Figure 14 shows the 8192 element test case of the 3-D hyperelastic beam in-its final "steady-state" deformed configuration.
The curved test cases represent a 3-D hyperelastic circular arch, a 3-D elastic cylindrical thick shell or "tunnel," and a 3-D elastic and hyperelastic torus subjected to line load at the summit (see Figure 15 ).
Similar to the straight test cases, two discretizations of the arch are considered consisting of 128 and 1024 8-node solid elements. Figure 16 shows the 1024 element circular arch in its deformed state. The tunnel is shown in Figures 17a and 17b in its undeformed and deformed configurations, respectively. The deformed configuration of the torus test case when loaded against a flat surface is shown in Figure 18 In preface to discussing parallel performance of the ADR algorithm, the interacting factors which govern this performance should be considered. A review of the primary factors is given next.
Performance Factors
Best performance (in terms of relative speedup) is achieved when the ratio of communication time to computation time is minimized, and when the computational load among the processors is balanced.
Computation time
The material and dimensionality of the elements affects computation time. Hyperelastic elements are more computationally demanding than elastic elements. For a given mesh, the hyperelastic test cases should lead to higher relative speedups than the elastic test cases due to increased computations relative to communication. Trilinear elements require more computations than bilinear elements, however they also require more communication since they have four nodes (as opposed to two) on each face.
The number of processors affects computation time. As the number of processors increases for a given problem, the number of elements (and therefore the computational load) per processor decreases. This decrease is mostly linear, although there is a small amount of computational cost associated with adding more processors, due to an increase in redundant computations. Since each processor executes the complete ADR algorithm, the kinematic variables (displacements, velocities, and accelerations) for the nodes on a given interface are computed by both processors sharing that interface. This redundant computational effort is increased as more processors (and therefore more interfaces) are added.
Communication Time
The time for nearest-neighbor communication is only a function of message length, which is defined for the ADR algorithm by the number of nodes on a given interface and the degrees of freedom associated with each node. For 1-D strip partitioning, the message length is mesh dependent and remains fixed regardless of the number of processors added. However, for 2-D block partitioning, the message length depends on both the mesh and the number of processors utilized. As described earlier, only two send-receive operations (in the case of 1-D strip partitioning) or four send-receive operations (in the case of 2-D block partitioning) are needed to implement all nearest-neighbor exchanges in each time step regardless of the mesh or number of processors. The time for global communication is a function of the number of processors. As mentioned previously, for an n x m array of V processors, the number of associated messages is equal to p. These messages only involve scalar quantities, and the message lengths are independent of problem size.
Load Balance
Relative speedup may be reduced due to load imbalances in which one or more processors are waiting on others to finish. For the applications considered here, load balancing is primarily a function of the number of elements assigned to each processor as discussed in Section 2. However, some degree of load imbalance may be introduced from the following two sources. Processors with elements which form the boundaries of the given mesh in the XY plane (elements along the left and right ends or top and bottom surfaces of the cantilever beams, for example) have fewer neighboring processors, and therefore perform less of the nearest-neighbor communication process. In addition, the contact enforcement algorithm represents additional computations for some processors.
DELTA Results
Parallel performance results for the DELTA are presented in Table 2 , and plots of relative speedup versus number of processors are shown for each test case in Figure 19 . For each test case, results were obtained using an increasing number of processors (i.e., a single processor, 16, 32, 64, 128, 256, and 512 processors) to obtain the data in Figure 19 . Two-dimensional block partitioning was used in all cases, therefore the total processor numbers given are expressed in the n x m format, where n denotes the number of rows of processors and m denotes the number of processors per row. The relative speedups shown are based on a total of 512 processors with five exceptions. The small 3-D arch test case is limited to 64 processors when 2-D block partitioning is used, since it represents a 2 x 32 element mesh in the XY plane. Likewise, the small 3-D beam test cases, the large 3-D arch and tunnel test cases represent 4 x 64 element meshes in the XY plane and are therefore limited to 256 processors. The message lengths shown in Table 2 refer to the total number of internal force values each processor must exchange for nearest-neighbor communication each time step. The number of time steps or iterations executed is also indicated. Some general trends are now discussed.
The small 2-D beam test cases and the 3-D 128-element arch test case exhibit the lowest relative speedups. These test cases have a small number of elements and therefore do not entail significant computational effort. As a result, even though the message lengths are short, the overall communication-tocomputation ratios are relatively high. The relative speedup achieved for the arch test case using 64 processors represents a parallel-processing efficiency of 39 percent. That is, the relative speedup is 39 percent of the theoretical or linear-relative-speedup limit of 64. The beam test cases exhibit an average efficiency of 50 percent for 64 processors, 18 percent for 256 processors, and 9 percent for 512 processors. Accordingly, scalability beyond 64 processors is very low for these test cases, as reflected by the slope of the relative speedup curves in Figure 17 . in Figure 17 . The potential exists for even higher relative speedups beyond 512 processors, although the parallel processing efficiency measure may become small in this range.
The highest relative speedups are exhibited by the large 3-D beam, tunnel, and torus test cases. These test cases have the lowest communication-to-computation ratios because of the large number of elements and the increased computational cost associated with the trilinear element. The two cantilever beam test cases achieve the highest relative speedups within this group since they have the shortest message lengths.
The relative speedups achieved for all five of these test cases represent an average efficiency of 84 percent for 256 processors. For 512 processors, the beam and torus test cases exhibit an average efficiency of 75 percent. Thus, the scalability of the ADR algorithm for these test cases remains high through 512 processors (see Figure 17 ), and efficient execution on many more processors (leading to much higher relative speedups) should be possible.
As expected, the hyperelastic versions of each test case exhibit higher relative speedups than their elastic counterparts. The hyperelastic test cases require more computations due to material, and as such achieve better performance. The effect of the partitioning method on performance has been investigated. Relative speedup results for several different partition or processor arrangements are presented in Table 3 for the large 3-D elastic beam. As shown, the effect is minimal (less than three percent).
All thirteen test cases were run to completion on the DELTA in order to demonstrate correctness of the parallel implementation of the ADR algorithm. The nonlinear static analysis results are given in Table 4 .
As shown, each test case was analyzed using the maximum number of processors allowed with the present 2-D block partitioning scheme. The number of time steps (or iterations) is given, as is the elapsed time to completion. The Y values are maximum vertical displacements for the free end of the beam and for the summit of the arch, tunnel, and torus. The X values are the corresponding horizontal displacements for the free end of the beam. The results shown are consistent with those of the single-processor implementation [2] . For some of the test cases involving contact, differences exist in the number of iterations and small differences exist in the final displacements when compared with the results given in reference [2] . This is because all aspects of the final contact formulations given in reference [1] were not in place at the time the results in reference [2] were generated.
The completion times for all of the test cases are less than one hour and demonstrate both the' computing power of the DELTA and the ability of the ADR algorithm to exploit this capability fully. As mentioned earlier, these results were achieved using unvectorized, baseline code. As shown in reference [2] , a vectorized version could further increase the speed of execution by a factor of the 0(5), assuming that the problem size is such that each processor has enough elements to achieve efficient vectorization.
CONCLUSIONS
The overall goal of this research is to develop efficient single-processor and multiprocessor implementations of the ADR algorithm and evaluate their performance for the static analysis of nonlinear, hyperelastic systems involving frictionless contact. For problems of this nature, the ADR algorithm may represent one of the best approaches for parallel processing. Performance evaluations on single-processor computers have shown that the ADR algorithm is reliable and highly vectorizable, and that it is competitive with direct solution methods for the highly nonlinear problems considered. In contrast to direct solution methods, it has minimal memory requirements, is easily parallelizable, and is scalable to more processors. It also avoids the ill-conditioning related convergence problems of other iterative methods for nonlinear problems. The objective of the present paper is to evaluate the performance of a massively parallel implementation of the ADR algorithm.
A parallel ADR algorithm is developed for nonlinear structural analysis and implemented on the 512processor Intel Touchstone DELTA system. It is based on the sequential code described in reference [2] and is designed such that each processor executes the complete sequential algorithm on a subset of elements. Onedimensional strip partitioning and two-dimensional block partitioning are used to divide the problem domain among the available processors. Load balancing is ensured by the use of structured, uni-material meshes. 
