As product designs become more sophisticated, both the finite element (FE) and the design optimization (DO) models have grown bigger. At the same time there is increasing evidence that computing clusters created with commodity chips are capable of outperforming traditional supercomputers. In this paper, the HYI-3D design optimization software system is discussed. HYI-3D has been developed to work in sequential and distributed processing environments. The two major objectives are as follows -(a) implement the FE methodology using the well-known domain decomposition technique where the original FE model is split into a number of smaller subdomains, and (b) implement a design optimization methodology for sizing, shape and topology optimization using coarse and fine-grain parallelism. For the FE engine, a direct sparse solver is used at the subdomain level and preconditioned conjugate gradient (PCG) is used at the interface system level. The finite element equations are then generated and assembled at the individual domain level in parallel. Matrix and vector operations involving sparse matrices form the bulk of the computations in this step. Once these equations are assembled, the condensed system level equations are formed. These condensed system level equations are usually much smaller (but denser) than the original system equations, and hence can be computationally expensive. With respect to design optimization, multi-level parallelism is employed. Not only can the finite element analysis be carried out in parallel but also other steps in the design optimization algorithm can be computed in parallel -gradients, line search and direction-finding problem. Numerical examples show the gains obtained from coarse and fine grain parallelism.
Introduction
Finite-element based design optimization is now a relatively well-established technique for engineering design. As finite element models have become more sophisticated and detailed, the execution time has also increased in spite of advances in hardware technology. It is interesting to note that users of finite element software have access to almost every single type of computer hardware and operating system. Typical design engineers now work in an engineering computing environment that is increasingly varied and heterogeneous. These include 32-bit desktop systems, 64-bit workstations, cluster of 32-bit machines, cluster of 64-bit machines, 32-bit and 64-bit machines with multiple processors in a shared memory mode as well as supercomputers with several processors and shared memory. The operating systems are also varied -Microsoft Windows, the various flavors of Linux and Unix, and others.
A typical design engineer works with a 32-bit or 64-bit desktop system. This system is used primarily for (a) pre and post-processing, (b) for design problems that can be executed on these systems in a reasonable amount of time, and (c) as a gateway to faster computing clusters. When the design engineer, needs to solve bigger problems, different type of parallel computing architectures can be used. The software development challenge is to design a system that will scale with available hardware. To make the most of the system, one should know the topology of the system and the desired topology for the application's threads and/or processes.
II. Parallel Processing Issues
Two widely used approaches in parallelizing computations involve the use of threads (usually on shared memory systems) and some form of message-passing (usually on distributed memory systems). There have been successful efforts at using both threads in the form of OpenMP directives and message-passing in the form of MPI calls to create high-performance software systems. OpenMP is used primarily with shared memory systems while MPI is the choice for distributed memory systems. MPI implementations use TCP/IP protocol or with custom switching hardware, proprietary protocol such as Myrinet etc. for increased bandwidth. Software development under this scenario offers challenges that are unique to parallel computations.
A. Hardware Issues
Computer systems with multiple processors have been around a long time especially in the form of mainframes and supercomputers. However, in the last decade there has been concerted effort to use commonly available CPUs in a variety of parallel processing architectures. Today, the three most popular architectures are clusters, constellations and massively parallel processing (MPP) systems. Seminal parameters of parallel systems include performance, parallelism, control, latency management, namespace distribution and reliance on commodity 7 . The implication for software developers is that (a) hardware evolution will take place at a very rapid pace for the foreseeable future, and (b) they must protect their investment with software standards that should evolve transparently with the changes in hardware.
B. Software Issues
In comparison to the development of (faster) hardware, little effort has been invested in the development and ready availability of software tools. There are two primary software-related issues.
Development and maintenance cost: The development and maintenance cost of any software system is a function of a number of parameters. In any case, there are two parameters that software developers need to pay particular attention to. First is the choice of programming language. Second, is the use of libraries (or function calls) for inter-process communication.
Load balancing and scalability: Any parallel algorithm needs to address the issues of load balancing and scalability. It is a challenge to develop an algorithm where a task can be split into n equally compute intensive parallel subtasks. One can have a fine granularity, where many processes work on the task that needs to be done (perhaps making load balancing easier), or coarse granularity that allows a larger ratio of computation to communication. The optimal granularity depends on hardware architecture, the number of processors, the problem size and the specifics of the problem being solved. On the other hand, scalability is the ability to use increasing number of processors as efficiently as possible. For most algorithms, a saturation point is reached with increasing number of processors beyond which the speedup drops. In other words, there is a nonlinear relationship between speedup and the number of processors. Finding the optimal number of processors for a given task is a challenging job.
III. Design Optimization Methodology
HYI-3D is developed using object-oriented concepts in C++ and uses MPI for message passing. The software suite is a collection of independent modules that can be plugged together to create the required set of finite element analysis and design optimization capabilities and can be executed either sequentially or in parallel. Traditional optimization techniques require function evaluation or both function and gradient evaluations. Function evaluation usually implies the use of a numerical simulation scheme such as finite element analysis that is compute and resource intensive. Gradient evaluation implies the use of some or all components of a function evaluation depending upon whether numerical or analytical derivatives are computed. The overriding concern then it to (a) minimize the number of function evaluations and (b) make the function evaluations as efficient as possible, as the design proceeds from the initial to its final state. One can attempt to meet these objectives by developing a good FEA engine as well as a robust, efficient and accurate DO methodology. There has been considerable attention paid to parallelization of optimization algorithms. 5, 6, 8, 10, 14, 19 
A. Finite Element Analysis Engine
The FE engine used during parallel computation uses the well-known domain decomposition (DD) idea. 11 The DD approach is used in all types of analyses. For example, the following steps are used during static analysis. 12 Step 1: Parse the FE Model Database In this step the FE model created by the preprocessor is read. Each processor reads the FE model database.
Step 2: Domain Decomposition Graph partitioning algorithm is used to generate the specified number of partitions or subdomains. The internal data structure for each subdomain is created. The internal and the boundary degrees of freedom are identified. The interior nodes are renumbered to reduce the amount of nonzero coefficients including fill-ins. Sparse matrix storage scheme is used for all the associated matrices.
Step 3 Step
4: Solution of Boundary Degrees of Freedom
In this step, communication takes place between each subdomain to form the total system equations. The unknowns are the boundary degrees of freedom. The equations are the n solved for the boundary degrees of freedom. An iterative solver is used to solve the effective system equations BBB = KDF (1) where B K is the stiffness matrix, B D is the displacement vector and B F is the force vector for boundary degrees of freedom. Convergence is said to have occurred if the absolute and the relative error norms for the residual vector are less than the user-specified tolerance.
Step
5: Solving for Internal Degrees of Freedom
Once the boundary displacements are obtained from Step 4, the internal degrees of freedom are solved. This is done in each subdomain separately.
Step 6: Post-Processing In this step the following is carried out in every subdomain. The performance of this methodology is a function of (a) how well the original problem is split into subdomains, (b) the underlying data structure to support the computations, (c) the efficiency of the sparse matrix algorithms, (d) the efficiency of the iterative solver, and (e) the efficiency of the algorithm to minimize the communications between the processes as the number of subdomains increases.
B. Topology Optimization
The approach followed for topology optimization is the SIMP (Simple Isotropic Model with Penalization) or power-law approach. 3, 4, 20 The topology optimization problem based on power law approach, where the objective is to minimize the compliance, can be stated as 
where f is the objective function, h is the mass fraction constraint function (M is actual mass used in optimization, 0 M is initial mass, and M F is desired mass fraction), KD=Fare the equilibrium equations, x is the design variable vector, and x L and x U are the lower and upper bounds respectively. The standard optimality criteria method 17 is used to solve the topology optimization problem. This method also incorporates a filtering technique 18 to handle checkerboarding (problems) and mesh dependency. Each iteration of the optimality criteria algorithm involves a complete FEA for calculation of compliance and strain energy for the updated design variables. Thus coarse-grain parallelization is achieved by performing parallel FEA for each iteration.
Parallel lagrange multiplier computation is performed using parallel line search (fine grain parallelism). The required lagrange multiplier value satisfies the mass fraction constraint and results in the updated design variable vector. The constraint function is assumed to be a monotonically decreasing function of the lagrange multiplier. Thus once the zero of the constraint function is bracketed, one -dimensional parallel line search using multi-section algorithm is performed. The multi-section algorithm is the (master-slave) parallel version of the bisection method for line search implemented as follows.
Step 1: Divide the interval bracketing the zero into p n equal parts ( p n is the number of processes) resulting in 1 p n − points (lagrange multiplier values) within the interval. Let these points be called the interior points.
Step 2: Each slave process is assigned an interior point and it obtains the constraint value for its value of lagrange multiplier.
Step 3 : These values are sent to the master processor and assembled in order.
Step 4: The new interval is the one whose end points are opposite in sign and the updated value of the lagrange multiplier is the average of the new interval endpoint values
Step 5 : This process is repeated until the interval is sufficiently small in size.
C. Sizing and Shape Optimization
Sizing optimization involves the modification of the cross section or thickness of finite elements to obtain an optimum objective function (minimum weight, maximum stiffness, etc.) while satisfying certain constraints (stress, displacement, etc.). In shape optimization the nodal coordinates of the model boundary are modified to obtain a shape that optimizes the objective function while satisfying the specified constraints. The hybrid natural approach 1, 15 is used to compute the velocity field.
The gradient-based optimizers used for sizing and shape optimization are Method of Feasible Directions (MFD) and Sequential Unconstrained Minimization Techniques (SUMT). MFD is applicable to non-convex problems for minimizing an objective function subject to inequality and bound constraints as shown below. 
where f is the objective function, m is the number of inequality constraints, ,(1,...,) j gjm = are the constraint functions, x is the design variable vector, and x L and x U are the lower and upper bounds respectively. The problem statement for SUMT is the same as MFD but without the inequality constraints.
MFD and SUMT involve obtaining the direction vector using objective function and constraint gradient information and then performing line search along the new search direction to find the step size that minimizes the objective function while satisfying the constraints. This process is repeated until convergence (no significant change in objective function, maximum number of iterations exceeded, etc.).
Gradients are evaluated using the forward difference method. The number of FEAs required during gradient evaluation is equal to the number of design variables. Thus gradient evaluation is parallelized such that the number of FEAs is divided equally among the available processes.
Parallel line search is implemented using a combination of the multi-section scheme 2 and parallel Avriel search 13 as described below.
Step1: Bracketing the minimum (a) Initiate a march with 1 p n − steps ( p n is the number of processes).
(b) Each process performs a function evaluation at its respective step and sends its objective function and maximum constraint values to the master process.
(c) The bracket is the interval whose end points have maximum constraint values of opposite sign.
Step 2: Zero-finding (a) Divide the bracket found in Step 1 into p n equal parts resulting in 1 p n − step sizes within the interval. Let these step sizes be termed interior points.
(b) Each slave processor is assigned an interior point and it obtains the maximum constraint value for its step size.
(c) These values are sent to the master processor and assembled in order.
(d) The new interval is the one whose end points have maximum constraint values of opposite sign and the updated step size U α is the feasible endpoint.
(e) This process is repeated until the interval is sufficiently small in size. The interval of uncertainty is [0, U α ].
Step 3: Function minimization using parallel Avriel search (a) Obtain 2 p kn = points within the interval of uncertainty using the following steps.
The initial interval is given by 
IV. Numerical Examples
Example problems using the developed software 9 are presented in this section. We first present information on the computing environments before discussing the numerical examples.
A. Cluster Information
Three different clusters are used in this study. Details of the clusters are presented below. Terminology: The term 1PN denotes execution when one process is launched per node in a cluster. Similarly, the term 2PN denotes execution when two processes are launched per node in a cluster. In terms of the processor speed, the ASU cluster has the fastest processor while the HYI and the Maui processors are comparable in speed † † . In terms of memory, the HYI cluster has 1 GB/processor, the Maui cluster has 512 MB/processor followed by the ASU cluster at 1 GB/processor with 1PN and 512 MB/processor with 2PN. Finally, the communication bandwidth for the Maui cluster is much better than the ASU cluster and both are significantly better than the HYI cluster. 1PN scenario is not used with MHPCC since it is not possible to guarantee that at any given time, only one process is launched per node.
Hawthorne & York (HYI) cluster:
The following definitions are used. Speedup for n processes, n S is defined as
where 1 t is the wall clock time for 1 process and n t is the wall clock time for n processes. (Percent) Efficiency of n processes, n E is defined as † † The SPECfp2000 results show that P4-1.7 GHZ Xeon to be about twice as fast as either P3-933 MHz or P3-1GHz.
Unless otherwise noted, a convergence tolerance (for the iterative solver) of 9 
10
− was used in all the problems. This tolerance value is (perhaps) too small and a larger value would have yielded acceptable displacement and stress values from a design perspective (see Example 4). Figure 1 shows the deformed shape of a steel support bracket. The bottom two holes are fixed and a distributed loading is applied to the top hole. Second-order 10-noded tetrahedral elements are used in the model. The model is made up of 41,953 elements and 64,975 nodes for a total of 194,925 degrees-of-freedom. The stiffness matrix has 7,492,017 nonzero coefficients before factorization and 104,726,757 nonzero coefficients after factorization. Figure 2 shows the graph of wall clock time versus number of partitions for the ASU cluster. Since each partition is handled by one process, the number of partitions is also equal to the number of processes. The largest speedups are obtained early on and the gains level off after an optimal number of processes. The performance of the 1PN scenario is generally better than 2PN. The hardware characteristics determine the throughput of the system to a very large extent. When the problem size is large (data does not fit in RAM) so that a large number of page faults occur, the performance of 2PN is worse simply because both processes access data in a shared memory mode. As the memory requirement decreases, the gap between 1PN and 2PN also decreases. The blips in the graph are usually due to the fact that the partitions do not have equal number of interior and boundary nodes, a load-balancing issue. Table 1 shows the timing values for all the three clusters. The single process timing value is obtained by running the single processor FE engine. The best speedup for the HYI and the ASU clusters are 2.14 (4 processes) and 5.34 (14 processes), and the best efficiency values are 80% (2 processes) and 86% (2 processes), respectively. For the MHPCC cluster, the best speedup is 9.9 (64 processes) and the best efficiency value is 37% (4 and 8 processes). For this example, the communication time between the processes becomes a more dominant factor as the number of processes increases. The speedup for the ASU cluster going from 1 to 4 processes is about 10% more than that of the HYI cluster. Similarly, speedup for the MHPCC cluster is about 62% more compared to the ASU cluster going from 4 processors (2PN) to 16 processors. Figure 3 shows the main body of a steel castor. The displacements at the cylindrical holes in each arm are fixed and an upward (compressive) uniform distributed loading is applied to the top surface of the two larger cutouts at the bottom. Second-order 10-noded tetrahedral elements are used in the model. The model is made up of 287,807 elements and 414,736 nodes for a total of 1,244,208 degrees-of-freedom. Figure 4 shows the FE model of the steel pressure vessel. The model is constructed using 3-noded triangular isotropic thin plate/shell elements. The model has 31,852 nodes yielding 191,112 degrees of freedom. The pressure vessel is provided with ring stiffeners distributed symmetrically about the centre of the tank. The tank is subjected to internal pressure of 0.1 MPa. There are 63,700 finite elements. Thus there are 127,400 stress constraints (von Mises failure criteria with the allowable stress as 25 MPa). The sizing optimal design problem is carried for minimizing the volume of the material used with 40 design variables (shell thicknesses).
B. Example 1: TET10-T64975 Problem

C. Example 2: CASTER5 Problem
D. Example 3: Pressure Vessel Si zing Design Optimization Problem
The problem converges in 11 iterations from an initial volume of 1.87 m 3 to an optimal volume of 0.114 m 3 . The results are presented in Table 3 for the ASU cluster. The von Mises stresses for the final design are shown in Fig. 5 .
The time for one process is estimated by multiplying the time for one FE analysis (200 s) by the total number of function evaluations (560) necessary for 10 iterations. The time required for other steps in the algorithm are estimated to be very small in comparison. Using coarse grain parallelism, a speedup of 2.86 is obtained. When fine grain parallelism is used, the speedup increases to 4.94. Figure 6 shows the FE model of the support bracket. The model is constructed using 114,859 10-noded tetrahedral elements. The model has 176,326 nodes yielding 528,978 degrees of freedom. Nodes on the bottom two holes are restrained in all directions. A line load of 1000 N/cm is applied in the global X direction on the nodes in the top hole. The topology optimization procedure is executed for 10 iterations using a mass fraction value of 30%. The initial compliance is 113023 N-cm and the final compliance is 29810 N -cm. The design optimization results are presented in Table 4 for the ASU cluster. In this example, the effect of the convergence tolerance used in the iterative solver is studied. The results show that there is little loss of accuracy when a larger convergence tolerance is used. However, the gains in computational efficiency vary between 24-32%.
E. Example 4: Support Bracket Topology Optimization Problem
V. Concluding Remarks
Results from a software system that uses a distributed parallel processing methodology are presented. Coarse and find grain parallelism in the FE engine and the DO engine are implemented. The results are extremely promising and show that the methodology can be used in both high-cost and low cost computing clusters to achieve performance gains. 
