Abstract. The Chip Multiprocessor (CMP) will be the basic building block for computer systems ranging from laptops to supercomputers. New software developments at all levels are needed to fully utilize these systems. In this work, we evaluate performance of different highperformance sparse LU factorization and triangular solution algorithms on several representative multicore machines. We included both Pthreads and MPI implementations in this study and found that the Pthreads implementation consistently delivers good performance and that a left-looking algorithm is usually superior.
Introduction
The Chip Multiprocessor (CMP) systems will be the basic building blocks for computers ranging from laptops to supercomputers. Compared to the superscalar microprocessors exploiting high degree of instruction level parallelism, the CMP designs represent a paradigm shift that strikes better trade-offs between performance (parallelism) and energy efficiency. In theory, the CMPs can often be programmed the same way as the conventional SMPs, but the CMPs have lower memory bandwidth and abundance of fine-grained parallelism. Given the diversity of CMP designs, it is necessary, albeit difficult, to develop new software strategies at the system level as well as the application level in order to fully utilize the hardware resources.
In this paper, we study the factorization and triangular solution kernels in the sparse direct solver SuperLU [1] on two leading CMP systems. Our goal of this study is twofold. First, we would like to evaluate performance of the existing implementations on the new CMP architectures, and secondly, we would like to identify the inefficiencies in the algorithms and/or implementations and the ways to improve them for the new architectures.
Experimental machines
Our testing systems include an Intel Colvertown, a Sun VictoriaFalls, and an IBM Power5. The last one contains a conventional SMP node. Table 1 summarizes the key architectural features of the three systems used in this study. The sources come from [2, 3, 4] .
The Intel Colvertown consists of two sockets, each with two pairs of dual-core Xeon chips (Core2Duo), with total eight processors (Dell PowerEdge 1950 dual-socket). Each core runs at 2.33 GHz with a peak performance of 9.3 Gflops (4 flops per cycle) and has a private 32 KB L1 cache. Each chip (two cores) share a 4 MB L2 cache. Each socket has access to a Front Side Bus (FSB) delivering 10.6 GB/s. The two independent FSBs are connected to the memory controller which interfaces to the DRAM channels, delivering 21.3 GB/s read memory bandwidth and 10.6 GB/s write bandwidth.
The dual-chip Sun VictoriaFalls contains 16 SPARCv9 cores, in which each CMP is a Niagara2 chip with 8 cores. Each core runs at 1.16 GHz with a peak performance of 1.16 Gflops, and has a private 8 KB L1 cache. All eight cores share a 4 MB L2 cache. In addition, each core The IBM p575 Power5 is a scalable distributed-memory high-performance computing system consisting of conventional SMP nodes. The entire system (bassi at NERSC) has 111 compute nodes, each of which has 8 Power5 processors running at 1.9 GHz and has a shared-memory pool of 32 GBytes. Each processor has a peak performance of 7.6 GFlops (4 flops per cycle) and has a private 32 KB L1 cache. We use only one SMP node in this study.
Overview of the algorithms and implementations
3.1. Factorization in SuperLU MT using Pthreads or OpenMP SuperLU MT [8] was first developed with Pthreads, targeted for the SMPs of modest size (e.g., 32 processors). Recently, we have added OpenMP support. The factorization uses a panel-based left-looking algorithm, with partial pivoting and possibly with diagonal preference to better preserve sparsity. The kernel is based on supernode-panel update, which invokes multiple calls to BLAS 2, effectively achieving BLAS 2.5 speed. Parallelization uses an asychronous and barrier-free dynamic algorithm to schedule both coarse-grained and fine-grained parallel tasks and achieve a high level of concurrency. A globally shared task queue is used to store the ready panels in the column elimination tree; and whenever a thread becomes free, it obtains a ready panel from the task queue. The coarse-grained task is to factorize the independent panels in the disjoint subtrees, while the fine-grained task is to update panels by previously computed supernodes. The scheduler facilitates the smooth transition between the two types of tasks, and maintains load balance dynamically. Figure 1(a) illustrates the left-looking factorization scheme and the dynamic scheduling method using the elimination tree.
Factorization in SuperLU DIST using MPI
In order to address the scalability issues, the parallel algorithm in SuperLU DIST [5] is significantly different from that in SuperLU MT. Using the supernode partition, we perform a two-dimensional (nonuniform) block-cyclic matrix-to-processor mapping. The factorization uses a block-based right-looking algorithm which comprises abundance of parallelism during the block Table 2 . Properties of the test matrices. Minimum degree algorithm was applied to the structure of |A| + |A| T . "fill-ratio" denotes the ratio of number of nonzeros in L + U over that in A; "Mean S-node" refers to an average number of columns in a supernode. outer-product updates to the Schur complements. We use the elimination DAGs to identify block dependencies, and a look-ahead scheme to overlap communication with computation on the critical path. Figure 1 (b) illustrates the 2D block-cyclic partition and distribution for a sparse matrix. 
Triangular solution in SuperLU DIST
The triangular solution phase in SuperLU MT is not yet parallel; therefore we evaluate only the parallel algorithm in SuperLU DIST. When solving Lx = b, where L is a lower triangular matrix, the ith solution component is computed as
Therefore, computation of x i needs some of the previous solution components x j , j < i, depending on the sparsity structure of the i-th row of L. This sequentiality often poses scaling hurdle for a parallel algorithm. Another hurdle to achieve good performance is the much lower arithmetic density as measured by flops per byte of DRAM access or communication, compared to factorization. In SuperLU DIST, the parallel triangular algorithm uses the same 2D block-cyclic distribution as used in the factorization phase. Figure 1(c) illustrates such a distribution and the solution procedure. The processes owning the diagonal blocks (called diagonal processes) are responsible for computing the corresponding blocks of the x components. Consider one block row of the L matrix, as circled in figure 1(b) , computation of the corresponding x entry needs to proceed following the steps 1 , 2 , 3 , and 4 . Communication is necessary when the data reside on different processes, as in steps 1 and 4 . Table 2 presents the characteristics of our benchmarking matrices, which are available from the University of Florida Sparse Matrix Collection [6] . We benchmarked the Pthreads version of SuperLU MT, and SuperLU DIST using MPICH [7] . Table 3 shows the parallel factorization times of the two solvers on the Clovertown. The time includes both symbolic and numerical factorization. First, when we used MPICH configuration with ch p4 device for communication through sockets, the code slowed down significantly beyond two or four cores. After we switched to ch shmem setup, we obtained respectable speedup. So for a large distributed system comprising manycore chips, it is imperative to be able to configure MPICH in a hybrid device mode -ch shmem within socket and ch p4 across sockets. Currently, this hybrid mode is not avaialbe. Second, we examine the single core performance. We would expect that SuperLU DIST outperforms SuperLU MT, because the former uses BLAS 3, whereas the latter uses only BLAS 2.5. This is true only with two matrices, g7jac200 and stomach, which have relatively denser L and U factors (the fill ratios are over 40; see table 2), and hence BLAS 3 plays a larger role. For sparser problems, the algorithms are memory-bound. We believe the worse performance of SuperLU DIST is mainly due to more memory traffic of the right-looking algorithm, especially more memory write operations. Third, we examine the speedups of the two codes. The last column of table 3 shows the speedup obtained when creating eight threads or MPI tasks. The best speedup is 4.3 and is less than what we observed on conventional SMP processors [8] . After performing code profiling, we found that the overhead of the scheduling algorithm using the shared task queue and the synchronization cost using mutexes (locks) are quite small. Further study is needed to understand where the time goes. Lastly, SuperLU MT usually achieves more speedup than SuperLU DIST. This can be seen in the row "speedup ratio ( MT/ DIST)" associated with each matrix. In some cases, SuperLU MT achieves a factor of two more speedup than SuperLU DIST. Table 4 shows the parallel factorization times on the Sun VictoriaFalls. The single-thread performance of SuperLU DIST is usually better than that of SuperLU MT. This is probably because the machine has a higher byte-to-flop ratio (see table 1) compared to Clovertown, hence it does not penalize an algorithm that is memory-bandwidth demanding, such as the rightlooking algorithm in SuperLU DIST. However, the coarse-grained task parallelism supported by MPI programming does not match the fine-grained multithreading architecture; MPICH uften crashes when more than 16 tasks are generated. The Pthreads program is much more robust, and SuperLU MT can effectively use 64 threads. Similar to the Clovertown, SuperLU MT usually achieves more speedup than SuperLU DIST. In some case, SuperLU MT achieves a factor of 2 more speedup than SuperLU DIST.
Experimental results
For the parallel triangular solution, we compare the eight-core Clovertown with the eightprocessor Power5 SMP node. The parallel runt imes are tabulated in labeled "Current" correspond to the current released code, and the columns labeled "Improved" refer to the new implementation as a result of this study. As seen in the table, the current code runs much more slower with more cores involved on the Clovertown. A similar trend was also observed on the VictoriaFalls. After profiling various parts of the code, we found that the slowdown is due to many calls of MPI Reduce, which can take over 75% of the time using eight cores. In figure 1(b) , each diagonal process needs to know which off-diagonal processes will have sum contributions to be sent to the diagonal process. To compute this count, every process holds a 0/1 flag indicating whether this process has nonzero blocks. Then all the processes in each row communicator perform an MPI Reduce (by SUM) over the flags, with root being the diagonal process. Overall, each block row corresponds to one such reduction operation. The improvement we have made is the following. Instead of performing many reductions with one integer, we allocate a flag array of integers, the size of which is the number of block rows owned by each process. Each entry is the flag associated with one block row. Then all the processes in the respective process row perform only one reduction operation on this flag array. This has greatly reduced the memory and communication latency cost. On eight-core Clovertown, the improvement is significant, ranging from 6-to 9-fold. Even on the conventional SMP node, such as eight-CPU Power5, we also obtained 63% to 84% improvement.
