Introduction
It is well known that the solution of sparse linear systems, generally expressed in the form Ax = b, is a core task of numerical simulation. In case of semiconductor device simulation the coefficient matrix A is unsymmetric, but structurally symmetric ([2] ). The solution of linear systems can be achieved by iterative or direct methods. While iterative methods do not always lead to a solution due to matrix conditions, direct methods usually consume more time and memory. 
Computational considerations
The problem of reducing the memory needed for a direct solver is closely related t o the problem of minimizing the size of the LU decomposition as a result of reordering the coefficient matrix. In this respect the Minimum Degree reordering algorithm proved t o be very successful for general sparse systems. This algorithm has been enhanced in terms of execution speed t o what is referred to as the M u l t i p l e M i n i m u m D e g r e e algorithm ( [5] ).
On the other hand, speeding up a direct solver basically means a faster computation of the LU factorization. Extensive research in this area has lead to so-called supernodal techniques. The key concept of these techniques is what is nowadays referred t o as a s u p e r n o d e [I] . During symbolic factorization, supernodes are identified as a set of consecutive columns in the f x t o r L of the LU decomposition with the following structural properties. A supernode formed by, s say, adjacent columns consists of two blocks: a dense diagonal block of size s x s and a block of width s below the diagonal block where all columns share the same sparsity pattern. A sample supernode is depicted in Figure 1 denoted with the letter S. Computing column J involves the following steps: for all supernodes S updating column J determine vector V1 = M*(D.V) and then subtract i t from the contents of J , i.e. J = J -Vl (see Fig. 1 ). The determination of vector V1 involves dense matrix-vector multiplications which run at vector speed on today's supercompnters. The subtraction step requires ga.ther/scatter operations which are mostly hardware supported on many supercomputers. As a result, supernodal techniques make excellent use of the hardware features and thus are highly powerful.
The computational power of supernodal techniques applied to symmetric positive definite linear systems has been documented in papers like [I] . Since in semiconductor device simulation the linear systems are usually unsymmetric but strnctura.lly sym- that were generated in the reordering and symbolic factorization steps. The next for-loop (2) goes one level deeper and scans over all nodes j of the current supernode J starting with the smallest index. The innermost loop (3) handles the contribution of all updating supernodes K t o the current node j. Finally, column/row j has to be scaled by its diagonal element (4).
Block supernode algorithms
Block supernode factorization operates on groups of columns/rows or even a whole supernode at the same time instead of merely focusing on a single column/row. Doing so does not reduce the number of references to memory by any means, but by grouping them together memory fetch and store can be made more efficiently, i.e. using the same index map only once throughout a loop cycle ([6] ). On the other hand, supernode-supernode factorization increases storage overhead significantly, since the intermediate results for more than one column/row have to be kept and other data structures had t o be added to support this technique. In our tests we have seen memory increase between at worst 6 t o 20 times over our supernode-node implementations. Furthermore, the time necessary to do the set up and administration of these data structures cannot be neglected.
. Benchmark
We present the timing results for a medium sized linear system stemming from a simulation of a MOSFET. The linear system has 12,000 unknowns and about 250,000 non-zero entries in the coefficient matrix. The benchmark was run on a Convex C220, a Cray-2, a Cray Y-MP, a NEC SX-3, and a Cray (398. The numbers shown in Table 1 are those of the best performing factorization algorithm in CPU seconds. Table 1 : Timing results for the best performing algorithm (seconds) the machines used in the benchmark we found block supernode algorithms t o perform better than the supernode-node algorithms. Mainly, there are two reasons for that:
For all of the block supernode algorithms implemented we noticed a significant increase of scalar memory references. This increase is stemming from the additional data structure handling built into the block supernode algorithms. Obviously, this hurts especially on machines with scalar data caches like the Convex and the NEC. Here, block supernode methods loose performance by suffering from scalar data cache misses.
0 Block supernode techniques are most effective when the supernodes contain marry columns/rows, i.e. when the supernode partitioning consists of a small number of snpernodes. A small snpernode partitioning provides for bigger blocks during supernode update. Tn our test cases supernodes contain 5 to 6 coli~mns/rows on average. Additionally, our factor columns/rows are very sparse (about, 200 non-zero entries maximum) so that there are only a few cases during the factorization where we can exploit, the potential of the block supernode algorithms.
In this paper we prese111,ed supernodal techniques suitable for struct,nrally symmetric linear systems as they appear in semicondi~ctor device simulation. Among these, supernode-node nptlatirrg schemes perform best, for this type of application. We have shown that block supernode methods cannot, be exploited to their fill1 potential which is due to the extreme sparsil,y of the linear systems and small supernode sizes.
Acknowledgement
The aothors highly appreciate the support of (:ray Research (Switzerland) for providing the original source code. Also, we thank the staff of the computer centers of the Swiss lnst,itutes of 'Sechrrology in Zurich and Lansanne as well as the Swiss Scientific Computing Center in Manno for providing us access to their supercomputers.
Additionally, we are grateflil to J.F. Biirgler and S. Muller (both from the Integrated System Laboratory) for providing the set of test cases. Our special thanks go to C. Pommerell (now AT&T Bell Labs) for a number of interesting discussions on the hallway of the laboratory, and to R.W. Peyton (ORNL) for his help on nrrderstarrtlirrg the original code.
