Abstruct-Two-dimensional device simulation is well established, but some devices require the use of three-dimensional simulation, at a much higher computational cost. This paper explores the issue of whether massively parallel computers are suitable for this problem. A novel algorithm for the solution of large sparse asymmetric linear systems is presented which uses constant CPU time per matrix iteration. Results are demonstrated on a wide variety of devices and mesh sizes.
I. INTRODUCTION UMERICAL simulation of semiconductor devices is
N an important tool for today's device and process designers. In fact, it allows the reduction of the time and cost required to develop new devices replacing expensive experiments with cheaper simulations. State-of-the-art two-dimensional simulators [ 11- [3] are now mature enough to be currently in use in most semiconductor companies.
However, recent advances in processing technology are creating the need of including in the simulation physical phenomena which cannot be modeled in a two-dimensional domain. As an example, the availability of submicrometer technology for MOS devices requires the accurate modeling of narrow channel effects which in turn imply the use of three-dimensional discretization. Accurate modeling of gate capacitance in EPROM devices has been achieved when the third dimension has been added to the model [4] . Finally, parasitic bipolar devices causing the latch-up in CMOS devices can be accurately modeled by taking into account the three-dimensional geometry of the layout [5] .
However, including these new effects yields an enormous increase in CPU time if the present two-dimensional algorithms are extended to cover the three-dimensional case on a conventional processor. For example, the discretization of the domain of a bipolar device may require the solution of 100 000 equations or more. Experiments with HFIELDS-3D [6] aimed at the three-dimensional simulation of a MOS transistor described with a small grid of 2565 nodes needed about 1.5 h on a Micro Vax GPX I1 for each bias point.
Vector machines, such as the Cray, are the most used computers for large scale device simulations. These computers typically have one very powerful scalar CPU which controls the execution of parallel operations on a pipeline composed of a few arithmetic and logic units (ALU's). Vector computers are the oldest among parallel machines and are still the most used due to their high reliability and the capability of efficiently using serial programs due to the availability of the scalar CPU. The parallelism achievable by this architecture, though, is limited by several factors, like, for instance, the I/O bottleneck between CPU's and memory.
In [7] vector machines were used to speed up the solution of linear equations for three-dimensional device simulations achieving a factor of four speed up over the scalar version of the code for problems having about 4000 unknowns using a Cray-1 . In [8] a three-dimensional device simulator has been presented based on finite difference techniques for discretizing the domain and a modification of the conjugate gradient method to solve the nonsymmetric linear system arising from the previous discretization. The vectorization of the algorithm obtains a speedup of sixteen for typical problems arising from threedimensional simulation. In [9] a three-dimensional simulator was presented and the algorithms for the solution of the linear system stemming from the Newton iteration were compared on a set of fairly small problems having up to 4000 grid nodes. In [lo] algorithms for multiprocessor machines were analyzed and characterized for a 16 processor computer.
Improvements in algorithms developed for low parallelism computers are not expected to provide the computation speed needed for an effective design of new structures and processes. However, recent advances in 0278 has not yet been addressed, even though it plays a key role in this application.
In this paper we present a new algorithm for the solution of device equations in three-dimensional domains and estimate their performance on several realistic examples, comparing it with the best available techniques suitable for conventional processors.
The paper is organized as follows: in Section I1 the programming model for a massively parallel computer is presented, in Section I11 the mathematical model of semiconductor devices is briefly introduced, in Section IV matrix solution techniques for the coupled solution of the driftdiffusion equations are addressed. In Section V, the efficiency of the proposed algorithm is studied on different examples and the performance on a massively parallel computer is compared with that which is achievable using good algorithms on conventional computers. Finally, in Section VI some conclusions are drawn.
11. PROGRAMMING MODEL In order to study massively parallel algorithms, actual measured results are necessary. To do this, we have implemented them on the Connection Machine.
The Connection Machine [ l l ] is a massively parallel computer with up to 65 536 processors, with a conventional computer as a front end. Each processor has a l-b CPU and 256k-bits of local memory. A floating point accelerator is shared by each cluster of 32 processors. The machine is a single instruction-multiple data (SIMD) [ 151 architecture. Communication between processors is either by a regular pattern on a N-dimensional grid or by an arbitrary pattern on a hypercube.
The Connection Machine has a mechanism for emulating more processors than actually exist in the machine, referred to as virtual processors. By using this, it is possible to emulate a machine with a factor of N more processors, where each processor then has a factor of N less memory, and runs a factor of N slower. N is called the virtual processor ratio (VP ratio), and is restricted to powers of 2. In practice, for small VP ratios, the slow down is less than a factor of N , because the overhead of the tasks on the front end are overlapped with the computation on the Connection Machine. For large VP ratios, the slow down becomes very close to a factor of N .
TMThe Connection Machine is a trademark of Thinking Machines Corp.
DEVICE SIMULATION
The steady-state drift-diffusion model of semiconductor devices, based on the Poisson equation and the continuity equations for electrons and holes [ 161, has been used in this work:
(1) (2) v (E?+) = -q ( p -n + N~ -N,)
where pn, p p , D,, Dp, are, respectively, the mobilities and diffusion coefficients for electrons and holes. The recombination rate R includes the Auger and Shockley-ReadHall terms. In the dielectric material, only (1) applies with a suitable definition of the charge trapped in the insulator.
Equations (1)- (3) have been discretized on a threedimensional finite difference grid and the well-known Scharfetter-Gummel method [ 171 has been applied to (2)-(3), obtaining a large set of algebraic equations. These nonlinear equations have been iteratively solved using the Newton method, which requires the solution of a large, asymmetric linear system at each iteration.
To map the problem onto a massively parallel computer, each node of the grid is assigned to a processor.
Thus each processor stores the local values of +, n, and p , and the three rows of the matrix of the corresponding grid node. Using this allocation, the grid fits naturally on the organization of the machine and, at the same time, variables having a strong coupling due to the spatial adjacency are tightly clustered. The resulting matrix is a banded matrix, whose seven diagonals are 3 by 3 matrices representing the interaction among the local variables.
The computation can be broken up into the following tasks:
problem readin and setup; evaluate the equations for the Jacobian and righthand side of the Newton iteration; solve the associated linear system; post-processing of the results.
The time required to read in and set up the simulation depends on how the input is specified. If the input is specified functionally, the time is proportional to the number of regions. If the input is taken from a process simulator or other explicit source, the time is proportional to the number of grid points in the profile.
The time to output results depends on which outputs are desired. The time to compute the current or charge on a given terminal is constant per terminal. To output a profile of the simulation results requires time proportional to the number of grid points in the specified profile.
This portion of running time is required for any device simulator, and is not affected by the fact that this is a parallel implementation. In addition, special hardware is available on the Connection Machine that implements
To evaluate the equations and load the matrix is straight forward on a massively parallel processor. Most of the computation is completely local to each processor, and so can execute fully in parallel. The only exception is that this computation requires the values of the state variables at the neighboring nodes, which is a very efficient operation on the Connection Machine and the time spent doing it is negligible compared to the evaluation of the nonlinear equations.
The most difficult of these tasks is that of solving the matrix, which is addressed in the rest of this paper.
IV. MATRIX SOLUTION METHODS
Several approaches are available for the solution of sparse linear systems on parallel computers, which can be divided into the general categories of direct and iterative methods.
A straight forward implementation of a direct solution method on a massively parallel computer would require O(N2) time where N is the number of grid nodes. A clever ordering scheme such as nested dissection [ 191 can be used to reduce this time for sparse positive definite matrices. However, fill-ins greatly increase the difficulty of solving the matrix both in terms of the memory used per processor, in the number of processors required, and in the communication patterns between processors.
For this reason, iterative methods [20] , [21] are preferred for solving large sparse linear problems. Their advantages include ease of parallel and vector implementations, and the ability to avoid fill-ins [22] .
Among these techniques, we have chosen the conjugate gradient squared (CGS) method [23] , which is commonly used for device simulation [5] . In order to obtain good convergence behavior, the matrix must be preconditioned. Our massively parallel algorithm uses incomplete LU decomposition for preconditioning. In this method, the operations carried out are the same as in the regular LU decomposition but every fill-in generated is discarded from the LU factors, thus yielding a much faster factorization that is an approximation of the full LU factorization. The ordering of the equations for this factorization has a significant effect on the convergence behavior of the CGS algorithm as well as on the number of operations that can be carried out in parallel. Unfortunately, more parallelism yields in general slower convergence, so that finding an ordering that minimizes the overall running time is not trivial.
Natural Ordering
A commonly used ordering for preconditioning on sequential machines is the "natural" ordering. In this ordering, grid nodes (hence, matrix equations) are numbered first in the x direction, then in the y direction, and finally, in the z direction (or some other permutation of x, y , and z ) . where N,, N y , and N, are the number of grid points in the x, y , and z directions, respectively.
During the LU factorization and the forward and backward substitution, the condition that determines whether or not a node can be eliminated is that all adjacent nodes that have a lower index must have been eliminated already. At first glance, this is a completely sequential algorithm. On closer inspection, however, a significant number of operations can be carried out in parallel as shown in Fig. 1 where a two-dimensional example is shown. The sets of equations that can be processed in parallel form diagonal lines passing through the grid. In this example, these sets are: {l}, (2, 5 } , (3, 6, 9}, (4, 7, 10, 13}, (8, 11, 14}, (12, 15}, and (16). In three dimensions, these sets form diagonal planes. The number of steps required by the algorithm in three dimensions on a massively parallel computer is N, + Ny + N, -2, which is O(N'13) if the number of grid nodes is increased uniformly in each of the three dimensions.
A preconditioner using the natural ordering has good convergence behavior, but it does not exploit well the architecture of a massively parallel processor. In fact, the time per iteration, O(N113), grows fast with the number of grid points.
Red-Black Ordering
A red-black ordering [22] provides for a much more efficient parallel implementation of the matrix operations, but as we shall see later, the convergence of the CGS method is significantly slower. This ordering labels each grid node either red or black such that each node is not adjacent to any node of the same color. Thus given the assumption of no fill-ins, each red node is independent of every other red node, and each black node is independent of every other black node. With this ordering, during the LU factorization, forward and backward substitutions, all of the red nodes can be eliminated simultaneously first, followed by all of the black nodes, also simultaneously.
This ordering is very efficient for massively parallel computers since it needs constant time per matrix itera- tion, and is the method of choice for solving Poisson's equation [ 121 where the speed of convergence of the conjugate gradient method is not strongly affected when compared to the natural ordering. However, for the solution of the coupled drift diffusion equations, the CGS method with the red-black ordering converges much slower than the natural ordering as shown in Fig. 2 for a MOSFET simulation on a 16 by 16 by 2 grid. The error is defined to be the maximum residual generated by the CGS algorithm. The total time for solving a matrix using the redblack ordering is larger than that for the natural ordering even though the latter requires much more time per iteration.
Partitioned Natural Ordering
Given that the natural ordering performs well for CGS, we examined a modification to the basic scheme to yield a new method that would allow more parallel operations, the partitioned natural ordering.
In the natural ordering, the sets of nodes that can be eliminated in parallel form diagonal planes through the grid as shown in Fig. 1 , and a plane has to wait for the factorization process to terminate on the preceding plane. Now if we ignore the dependency between the nodes on plane m + 1 and the nodes on plane m, then we partition the original matrix into two parts: the one from plane 1 (the left-top node) to plane m and the other from plane m to plane n, the last plane (the right-bottom node) of the grid. In this case, at the first step of the iteration, all the nodes on planes 1 and m + 1 can be processed in parallel. After this step, all the nodes on planes 2 and m + 2 can be processed in parallel, and so on. Using the example in Fig. 1 , the groups of nodes that can be eliminated in parallel are now: (1, 8, 11, 14}, (2, 5 , 12, 15}, (3, 6, 9, 15}, and (4, 7, 10, 13).
This method was first applied to the incomplete LU factorization, but gave disappointing results since many iterations were needed to achieve convergence. However, note that the CGS method requires only one LU factorization per matrix solution, but several forward and backward substitutions (one of each per iteration). Thus it is more important to speed up the forward and backward substitution processes than the factorization process. Hence, the partitioned natural ordering method was applied to forward and backward substitution as follows.
The incomplete LU factorization is camed out using the natural ordering, and then entries in the L and U factors, which link the elements of the matrix corresponding to the nodes across the partition, are discarded. In this way, the numerical values of the entries of the LU factors are the same as in the natural ordering, but the results of the forward and backward substitution processes are different.
If the point at which the planes are partitioned is the same for both the forward and backward substitutions, there would be no data transferred between the partitions because all of the matrix elements connecting the partitions would be set to zero. Therefore, the point at which the planes are partitioned is offset for the backward substitution compared to the forward substitution. For example, referring to Fig. 1 , if the partition for the forward substitution is between planes 4 and 5 , then for the backward substitution it would be between planes 3 and 4.
Of course, this technique is not limited to two partitions. Fig. 3 shows how the convergence is affected by increasing the number of partitions. As can be seen, the speed of convergence is slower as the number of partitions is increased, but not by orders of magnitude. Fig. 4 shows the CPU time for the complete solution of a matrix with respect to the number of partitions using this algorithm. The matrix being solved was generated from a MOSFET simulation on a 16 X 32 X 8 grid, so the number of sets of equivalent nodes in the dependency graph (or planes through the grid) is 54. This graph shows that the CPU time is minimized when the number of partitions is in approximately the range of 15 through 27, or, in other words, when there are approximately two or three planes per partition. As the number of partitions increases beyond half the total number of planes, the number of iterations for convergence increases significantly. All of the results presented in this paper related to the partitioned natural ordering use two planes per partition since this partition offers the maximum amount of parallelism without affecting the convergence speed by an intolerable amount.
Since the partitioned natural ordering uses a fixed number of planes per partition (two), the CPU time per matrix iteration is now a constant, as it was with the red-black ordering, but at the same time retains most of the convergence properties of the full natural ordering. Even with this approach to speed up forward and backward substitutions, Table VI shows that the time for the LU factorization (even though it is computed with the full natural order) is always significantly smaller, and growing at a smaller rate, than the time for the forward and backward substitutions.
V . NUMERICAL RESULTS
The simulator has been run on a variety of devices in order to study the performance of the proposed approach and to show its correctness. The machine used to generate these data is a 64k processor CM2 with a Sun4 as the front end. These devices are a diode, a MOSFET, and a bipolar transistor. All simulations here were initialized with piecewise constant potential and carrier concentration which associate with each region the potential of its contact and the concentration of the majority carrier. This initialization is used as the starting point for the coupled Newton iteration.
For the examples shown, the number of grid nodes in each dimension is always a power of two. This is not a restriction of the program, but if the number of grid nodes is less than a power of two in some direction, then the remaining processors up to that power of two will be idle.
The geometry of the diode is a 3-pm cube of p-type semiconductor with NA = 10l6 cme3. The n-type region is a square of area 1 pm2 and a junction depth of 1 pm, with a doping of 10l6 ~m -~. The results of the simulation are shown in Table I for a simulation at 1-V forward bias.
The MOSFET is an n-chantlel device with a W / L of 1 p m / l pm with an oxide thickness of 28 nm. The source and drain regions have a doping of 2 X lo2' cm-3 and a junction depth of 0.3 pm. The substrate doping is 2.5 X 1OI6 ~m -~. The results of the simulation are shown in Table I1 for a simulation with V, = l V and V,, = 0.5 V. Table U1 shows the results of a MOSFET simulation with the gate voltage swept from 0 to 3 V in 10 steps. The Newton iteration is initialized at each step by extrapolating the solution from the two previous steps, except for the first two steps which are initialized in the usual manner. This table shows that the number of matrix iterations required when the device is tumed on is similar to the number of matrix iterations when the device is tumed off as in Table 11 .
The bipolar is a standard n-p-n device, with a base width of 0.1 pm. The emitter doping is 2 X Id' ~m -~, the collector is lo2' ~m -~, the active base region is 1.4 X 10l8 ~m -~, the base contact region is 3 X 1019 ~m -~, the buried N+ region is 3 x 1019 ~m '~, and the substrate is Table V for a simulation with V,, = 1 V, and V,, is swept from 0.1 to 0.9 V. These tables allow us to show the effect of different VP ratios. The CPU time is, in fact, constant per iteration up to the limit of the number of physical processors in the Connection Machine, after which the time approaches a linear increase.
A graph of CPU time per iteration for these simulations is shown in Fig. 5 , compared against the time per iteration of a standard natural ordering on a sequential machine. The sequential machine used in this comparison is a DEC 5400 equipped with 64 Mbytes of memory. The tests that were carried out all fit within the main memory of the machine, so there were no effects due to the use of virtual memory.
The total CPU time required for a complete solution of a matrix as a function of the number of grid nodes is shown in Fig. 6 for the diode, and in Fig. 7 for the MOSFET. These data are shown for the Connection Machine and for a sequential computer. The data for the sequential com- puter were produced by dumping a matrix from the device simulator on the Connection Machine, and then solving this matrix on a sequential computer with a Fortran program that uses a CGS iteration with the natural ordering. This is an algorithm that is commonly used on sequential and vector computers, so Figs. 6 and 7 are comparing the algorithm on the Connection Machine versus a good (reportedly the best) algorithm on a sequential machine. We have the actual results for the DEC 5400, but we also estimated the performance of our massively parallel implementation against super computers using the 6601 mesh points results in [5] , where the performance of an existing machine in the supercomputer class was reported to be 50 times faster than machines in the class of the DEC 5400 when solving linear systems with preconditioned CGS. As can be seen, the Connection Machine algorithm ex-2 + 3 l t ceeds this estimate of supercomputer performance for problems of greater than 15 000 grid nodes.
A graph of the total number of matrix iterations as a function of grid size is shown in Fig. 8 , both for the standard natural ordering and for the partitioned natural ordering. The number of iterations required for the different devices varies because the structure of the device affects the numerical properties of the simulation. This shows that the rate of increase of the number of iterations required for the partitioned natural ordering is not significantly greater than that for the standard natural ordering. The plot of the MOSFET simulation using natural ordering does not extend as far as that for the partitioned method because the CPU time required for the natural ordering was significantly larger. than that for the partitioned method. A breakdown of the CPU time spent in each of the sections of the simulator for a complete simulation is shown in Table VI and Fig. 9 , as a function of the number of grid nodes. The device being simulated here is the MOS-FET at a single bias point. Loading the equations requires constant time per Newton iteration, which remains roughly constant per simulation. The time to factor the matrix is 0(N'I3) and is required once per Newton iteration. This is the only component in the body of the program that is not constant time per matrix iteration, but the total time is still dominated by the time to solve the matrix.
VI. CONCLUSIONS
Based on a CGS iteration with a partitioned natural ordering, a new massively parallel algorithm for threedimensional device simulation has been presented. This algorithm requires constant time per matrix iteration independent of the number of grid nodes, and has convergence properties comparable to those of good algorithms currently used on sequential computers. This algorithm has been implemented on the Connection Machine and achieves supercomputer performance for large numbers of grid nodes. The results presented in this paper show that massively parallel computers are effective for the simulation of semiconductor devices.
