Abstract. The Chip Multiprocessor (CMP) will be the basic building block for computer systems ranging from laptops to supercomputers. New software developments at all levels are needed to fully utilize these systems. In this work, we evaluate performance of different highperformance sparse LU factorization and triangular solution algorithms on several representative multicore machines. We include both pthreads and MPI implementations in this study, and found that the pthreads implementation consistently delivers good performance and a left-looking algorithm is usually superior.
Introduction
The Chip Multiprocessor (CMP) systems will be the basic building blocks for computers ranging from laptops to supercomputers. Compared to the superscalar microprocessors exploiting high degree of instruction level parallelism, the CMP designs represent a paradigm shift that strikes better trade-offs between performance (parallelism) and energy efficiency. In the case of multicore architecture, high performance is achieved by replicating the execution units on a single die while keeping the clock rate (hence power consumption) relatively low. In the case of multithreaded architecture, high throughput is achieved by providing multiple sets of hardware thread contexts for each FPU and simultaneously executing multiple streams of instructions without relying on speculation.
Multithreading can effectively hide instruction and cache latency. In theory, the CMPs can often be programmed the same way as the conventional SMPs, but the CMPs have lower memory bandwidth and abundance of fine-grained parallelism. Given the diversity of CMP designs, it is necessary, albeit difficult, to develop new software strategies at the system level as well as the application level in order to fully utilize the hardware resources.
In this paper, we study several kernel algorithms associated with sparse direct solvers on a couple of leading CMP systems. Direct solvers based on matrix factorizations are among the most reliable methods or preconditioners for solving sparse linear and eigen systems. They are often the computational bottlenecks in large-scale computer modeling codes. Over the past decade or so, we have been developing new algorithms to exploit advanced high-performance, largescale parallel computers. Our algorithm research has led to the software package called SuperLU [4, 7] , which is widely used in research and industry.
Previously, Williams et al. [14] performed extensive study and optimization of sparse matrix-vector multiply (SpMV) on several leading CMP systems. In SpMV, the matrix A needs to be read only once, hence the ratio of flops to memory accesses is O(1). The operation is largely memory-bound, since there is hardly any reuse, They develoepd various blocking strategies exploiting different hardware components, including thread blocking, register blocking, cache and local store blocking, and TLB blocking. In contrast, the factorization algorithm need to access the matrix A, the L and U factors multiple times, which exhibit higher reuse. But the ratio of flops to memory accesses changes throughout the elimination process. In early stages of elimination, the factors are sparser and workload is memory-bound. Whereas in later stages, the factors are denser; level 3 BLAS are appropriate and hence the workload is compute-bound. The change of arithmetic density makes it harder to develop uniform strategies for performance optimization.
Our goal of this study is two-fold. Firstly, we would like to evaluate performance of the existing implementations on the new CMP architectures, and secondly, we would like to identify the inefficiencies in the algorithms and/or implementations and the ways to improve them for the new architectures.
Experimental Machines
Our testing systems include an Intel Colvertown, a Sun VictoriaFalls, and an IBM Power5. The last one contains a conventional SMP node.
The Intel Colvertown consists of two sockets, each with two pairs of dual-core Xeon chips (Core2Duo), with total eight processors (Dell PowerEdge 1950 dualsocket). Each core runs at 2.33 GHz with a peak performance of 9.3 Gflops (4 flops per cycle), and has a private 32 KB L1 cache. Each chip (two cores) share a 4 MB L2 cache. Each socket has access to a Front Side Bus (FSB) delivering 10.6 GB/s. The two independent FSBs are connected to the memory controller which interfaces to the DRAM channels, delivering 21.3 GB/s read memory bandwidth and 10.6 GB/s write bandwidth.
The dual-chip Sun VictoriaFalls contains 16 SPARCv9 cores, in which each CMP is a Niagara2 chip with 8 cores. Each core runs at 1.16 GHz with a peak performance of 1.16 Gflops, and has a private 8 KB L1 cache. All eight cores share a 4 MB L2 cache. In addition, each core supports eight hardware threads, and the entire dual-chip system provides a total of 128 threads. The two sockets are interconnected via External Coherence Hubs (ECH). There are altogether 8 FBDIMM memory channels, delivering the aggregate DRAM bandwidth of 42.6 GB/s for read and 21.3 GB/s for write.
The IBM p575 Power5 is a scalable distributed memory HPC system consisting of conventional SMP nodes. The entire system (bassi at NERSC) has 111 compute nodes, each of which has 8 Power5 processors running at 1.9 GHz and has a shared memory pool of 32 GBytes. Each processor has a peak performance of 7.6 GFlops (4 flops per cycle), and has a private 32 KB L1 cache. We only use one SMP node in this study. Table 1 summarizes the key architectural features of the three systems used in this study. Figure 1 shows the simplified block diagrams of the two CMP systems. The sources come from [11, 12, 14] .
Several characteristics in Table 1 are worth noting. Compared to the IBM SMP node, the cores-to-DRAM bandwidths are considerably lower on the multicore systems. In addition, the write bandwidth on both multicore systems is only half of the read bandwidth. The byte-per-flop ratio shows the balance of the memory bandwidth versus floating-point speed, with larger ratio indicating higher bandwidth relative to core speed. The CMP systems are clearly worse than conventional SMPs in this regard, implying that algorithms demanding larger memory bandwidth will be penalized in performance. The conventional SMPs usually have more complex designs and consume more power, see the last line in the table. The quoted number for p575 was measured while running a full computational workload, which is much lower than the manufacturer's peak power rating [12] . 
Overview of the Algorithms and Implementations
Over the years, we have been deveoping the SuperLU suite of libraries, with different variants of sparse Gaussian elimination targeted for shared or distributed memory high performance machines [7] . Below, we briefly summarize the algorithmic and implementational features of the two variants used in this study.
Factorization in SuperLU MT using Pthreads or OpenMP
The initial target platforms of SuperLU MT were the SMPs of modest size (e.g., 32 processors). It was first developed with Pthreads, and recently, we have added OpenMP support. The earlier tests on a number of commercially popular SMPs, such as Sun, DEC Alpha, SGI Origin, and Cray C90/J90, demonstrated excellent speedups [3, 6] . Pleasantly, we will show that this code is equally suitable for the modern multicore machines. The algorithmic features and parallelization techniques are outlined below: -Use panel-based left-looking factorization, with partial pivoting and possibly with diagonal preference to better preserve sparsity. Use supernode-panel update kernel to effectively use Level 3 BLAS. -Use an asychronous and barrier-free dynamic algorithm to schedule both coarse-grain and fine-grain parallel tasks to achieve a high level of concurrency. A globally shared task queue is used to store the ready panels in the column elimination tree, and whenever a thread becomes free, it obtains a ready panel from the task queue. The coarse-grain task is to factorize the independent panels in the disjoint subtrees, while the fine-grain task is to update panels by previously computed supernodes. The scheduler facilitates the smooth transition between the two types of tasks, and maintains load balance dynamically. Figure 2 illustrates the left-looking factorization scheme, and the dynamic scheduling method using the elimination tree.
Factorization in SuperLU DIST using MPI
The target platforms of this code are the massively parallel distributed memory computers [8] . Previously, we tested this code on a number of HPC platforms, including Cray T3E, IBM SP, and various Linux clusters. Good scalability was demonstrated up to one thousand processors. In order to address the scalability issues, the parallel algorithm is significantly different from that in SuperLU MT. The main differences are in pivoting strategy and matrix distribution, which are summarized below.
-Use block-based right-looking factorization, which comprises abundance of parallelism during the block outer-product updates to the trailing submatrices. According to the supernode partition, perform a two-dimensional (nonuniform) block-cyclic matrix-to-processor mapping. Use the elimination DAGs to identify task and block dependencies, and a look-ahead mechanism to better overlap communication with computation and shorten the critical path. -Before factorization, pre-permute the rows of the matrix so that the diagonal has entries of large magnitude, using a weighted bipartite matching algorithm from MC64 [5] . During factorization, allow single precision perturbation to the small diagonal entries. Figure 3 illustrates the 2D block-cyclic partition and distribution for a sparse matrix. 
A note on SuperLU MT -symmetric mode
In order to conduct direct comparision between SuperLU MT and SuperLU DIST, we have added a new algorithmic choice for SuperLU MT, which is called symmetric mode. As is known, with partial pivoting (SuperLU MT), it is better to use A T A-based column ordering strategy to preserve sparsity, since pattern-wise, the Cholesky factor of A T A upper bounds the LU factors in the decomposition P A = LU , for any row permutation P . However, in the case of satic pivoting where pivots are selected before hand (SuperLU DIST), no row interchanges are made during factorization, then the A T A-based upper bound becomes too loose. Therefore, we can use a tighter upper bound based on A + A T . The difference in the amount of fill using an A T A-based sparsity ordering or an (A + A T )-based one can be more than a factor of two. The new symmetric mode option in SuperLU MT contains an algorithm that is similar to the one in SuperLU DISTit uses MC64 to perform static numerical pivoting, an (A+A T )-based symmetric sparsity ordering, and single precision diagonal perturbations when needed.
All our experimental results used the symmetric mode in SuperLU MT. Thus, the amount of fill and number of floating-point operations are roughly the same with both solvers.
Triangular solution in SuperLU DIST
The triangular solution phase in SuperLU MT is not yet parallel, therefore we will evaluate only the parallel algorithm in SuperLU DIST. In Lx = b, where L is a lower triangular matrix, the i-th solution component is computed as
Therefore, computation of x i needs some or all of the previous solution components x j , j < i, depending on the sparsity pattern of the i-th row of L. This sequentiality often poses scaling hurdle for a parallel algorithm. Another hurdle to achieve good performance is the much lower arithmetic density as measured by flops per byte of DRAM access or communication, compared to factorization. In the current implementation of SuperLU DIST, the parallel triangular algorithm uses the same 2D block-cyclic distribution as used in the factorization phase. Figure 4 illustrate such a distribution and the solution process. The processes owning the diagonal blocks (called diagonal processes) are responsible for computing the corresponding blocks of the x components. When x j is needed in L ij · x j , and the owners of x j and L ij are different, x j 's processor needs to send it to the processor of L ij , see 1 in Figure 4 . In case of 2 , no communication is needed because both x j and L ij reside on the same processor, i.e. processor 1. After receiving the needed x j entries, each processor proceeds with local summation, i.e., step 3 in Figure 4 . Finally, the local sums are sent to the diagonal processor which performs the division, see 4 in the figure. 
Entire solvers
Since the numerical pivoting methods are different in the two solvers -partial pivoting in SuperLU MT and static pivoting in SuperLU DIST, the high level structure of the two codes are different. In case of partial pivoting, the fills are generated dynamically, so the symbolic factorization step cannot be separated from numerical factorization. Whereas with static pivoting, we can separate symbolic and numerical factorization steps. Figure 5 summarizes the major steps of the two solvers and highlights the differences between them. Table 2 presents the characteristics of our benchmarking matrices, which are available from the University of Florida Sparse Matrix Collection [2] . In the following subsections, we will present the parallel runtimes and analysis. We benchmarked the Pthreads version of SuperLU MT, and SuperLU DIST using MPICH [9] . Table 2 . Properties of the test matrices. Minimum degree algorithm was applied to the structure of |A| + |A| T . "fill-ratio" denotes the ratio of number of nonzeros in L + U over that in A; "Mean S-node" refers to an average number of columns in a supernode. 
Experimental results

Characterization from hardware performance counters
Given that the memory system performance plays increasingly significant role on the CMP architectures and with the sparse matrix algorithms, it would be more relevant to quantify the memory access patterns of the different algorithms than to count the flops alone. For simpler kernels like SpMV, it is possible to count manually. For complex codes, we need to resort to performance analysis tools.
The first tool we used is PAPI [10] which provides an API to access machines' hardware counters. As a first cut, we examine the load and store instruction counts, which are independent of the cache memory organization. Table 3 compares the counts of the left-looking (SuperLU MT) and right-looking (SuperLU DIST) factorization algorithms. It is clear that right-looking algorithm incurs many more load and store instructions, typically an order of magnitude more. Although the load/store instruction count indicates the superiority of the left-looking algorithm in the sense of program's static behavior, we are also interested in the temporal behavior of the codes while running on an actual machine. Unfortunately, the two multicore machines do not yet have proper PAPI support for such study. So we used the CrayPat performance tool provided on the Cray XT systems [1] . CrayPat uses PAPI's counters to collect raw data, and then computes a variety of derived quantities which are easy to understand. The machine we used is the Cray XT4 installed at NERSC. Each node consists of a 2.6 GHz dual-core AMD Opteron processor, sharing 4 GBytes of memory. Each core has a 64KB L1 data cache of and a 1MB L2 cache. The L2 cache is a victim cache which holds only the cache lines evicted from L1, whereas most data loaded from memory go directly to L1. We used only one core to run the codes and collected data shown in Table 4 , and the data are for the entire solvers. We report two metrics: "Mem-to-D1" measures the amount of data transferred between memory and L1 data cache, and "L2-to-Mem" measures the data traffic between L2 and memory. In both metrics, we see that SuperLU DIST requires considerably larger amount of data transfer. Lastly, we examine the flop-to-load (store) ratio in the triangular soution phase. We compare the metric for the two distinct stages of the solver, one is "ordering + factor" and the other is "tri-solve". In Table 5 , we report the respective ratios of flop-to-load and flop-to-store. As can be seen, in both metrics, the triangular solution phase has much smaller flop density, sometimes can be more than an order of mangnitude lower than the other part of the solver. 
Runtime
In the factorization codes of both solvers, the BLAS routines could take more than 30-40% of the time. Therefore the BLAS speed is a key performance bound. In sparse codes, the matrix size for BLAS calls is usually small. The kernel in SuperLU MT is "BLAS 2.5", where we perform multiple DGEMV calls with different vectors while keeping the matrix in cache. Therefore, we usually keep the matrix size bounded by 200 × 100. The kernel in SuperLU DIST is BLAS 3 (mostly DGEMM). In order to maintain good load balance, we use even smaller block sizes, such as 50 × 50. In Figure 6 , we plot the performance (Gflops rate) of DGEMV and DGEMM on Clovertown and VictoriaFalls. In each case, we used the vendor's high performance mathematical libraries -Intel's MKL and Sun's SunPerf.
Recall that each core processor of VictoriaFalls is hardware-multithreaded. But without explicit parallelization (e.g., threading) at the software level, DGEMM only achieves less than one-third of the peak performance. That is, a single threaded program is incapable of fully utilizing the resources provided by a multithreading architecture. Table 6 shows the parallel factorization times of the two solvers on Intel Clovertown. For fair comparison, the time includes both symbolic and numerical factorization, because it is not possible to separate these two steps in SuperLU MT, see Figure 5 .
First we note that MPICH can be configured with either ch shmem device for shared memory processors, or ch p4 device for communication through sockets on distributed memory machines. We first used the default ch p4 configuration on the Clovertown cluster, and found that the code slowed down significantly beyond two or four cores. After we switched to ch shmem setup, we obtained respectable speedup. Therefore, for a large distributed system comprising manycore chips, it is imperative to be able to configure MPICH in a hybrid device mode -ch shmem mode within socket and ch p4 mode across sockets. Currently, the hybrid mode is not avaialbe.
Secondly, we examine the single core performance. We would expect that SuperLU DIST outperforms SuperLU MT, because the former uses BLAS 3 whereas the latter uses only BLAS 2.5. We see that this is true only with two matrices, g7jac200 and stomach, which have relatively denser L and U factors (the fill ratios are over 40, see Table 2 ), and hence BLAS 3 plays a larger role. For sparser problems, the algorithms are memory-bound. We believe the worse performance of SuperLU DIST is mainly due to more memory traffic of the right-looking algorithm, especially more memory write operations, see the measures presented in Section 4.1.
Thirdly, we examine the speedups of the two codes. The last column of Table 6 shows the speedup obtained when creating eight threads or MPI tasks. The best speedup is 4.3 and is less than what we observed on conventional SMP processors [3] . After performing code profiling, we found that the overhead of the scheduling algorithm using the shared task queue and the synchronization cost using mutexes (locks) are quite small. Further study is needed to understand where the time goes.
In addition, SuperLU MT usually achieves more speedup than SuperLU DIST. This can be seen in the row "speedup ratio ( MT/ DIST)" associated with each matrix. In some cases, SuperLU MT achieves a factor of two more speedup than SuperLU DIST. Table 7 shows the parallel factorization times on the Sun VictoriaFalls. Recall that this system has 16 eight-way harware-threaded cores, and altogether we can have up to 128 threads. The single-thread performance of SuperLU DIST is usually better than that with SuperLU MT. This is probably because the machine has a higher byte-to-flop ratio (see Table 1 ) compared to Clovertown, hence it does not penalize an algorithm that is memory-bandwidth demanding, such as the right-looking algorithm in SuperLU DIST.
However, the coarse-grain task parallelism supported by MPI programming does not match the fine-grain multithreading architecture -MPICH often crashes when more than 16 tasks are generated. The Pthreads programming is much more robust, and SuperLU MT can effectively use 64 threads. Similar to Clovertown, SuperLU MT usually achieves more speedup than SuperLU DIST. This can be seen in the row "speedup ratio ( MT/ DIST)" associated with each matrix. In some case, SuperLU MT achieves a factor of 2 more speedup than SuperLU DIST.
We see that SuperLU MT achieves nearly perfect speedups for the first four to eight threads. This may be related to the Sun Solaris' round-robin scheduling policy which schedules multisocket first, then multicore, then multithreads [13] . With this order, the first few threads are spread across different sockets, and do not have much memory bus contention. We now evaluate performance of the parallel triangular solution algorithm in SuperLU DIST. We compare the eight-core Clovertown with the eight-processor Power5 SMP node. The parallel runtimes are tabulated in Table 8 . The columns labeled "Current" correspond to the current implementation, and the columns labeled "Improved" refer to the new implementation as a result of this study.
It is very disapointing that on the Clovertown, the current code runs much more slower with more cores involved. A similar trend was also observed on the VictoriaFalls. After we profiled various parts of the code, we found that the slowdown is due to many calls of MPI Reduce; in fact, on eight cores, MPI Reduce can take over 75% of the time. Consider one block row of the L matrix, as circled in Figure 4 , the diagonal process 0 needs to know which off-diagonal processes (1 and 2) will have sum contributions to be sent to process 0. To compute this count, every process holds a 0/1 flag depending whether this process has nonzero blocks. Then, all the processes in each process row perform an MPI Reduce (by SUM) over the flags, with diagonal process being the root. Overall, each block row corresponds to one such reduction operation.
The improvement we have made is the following. Instead of performing many reductions with one integer, we allocate a flag array of integers, the size of which is the number of block rows owned by each process. Each entry is the flag associated with one block row. Then all the processes in the respective process row perform only one reduction operation on this flag array. This has greatly reduced the memory or communication latency cost. On eight-core Clovertown, the improvement is significant, ranging from 6-to 9-fold. Even on the conventional SMP node, such as eight-CPU Power5, we also obtained restpectable improvement, from 63% to 84%.
Note that the Clovertown time still does not scale as well as the Power5 time. Further investigation is needed in the future. 
Final remarks
We performed preliminary study of the SuperLU sparse direct solvers on representative multicore achitectures. Using the performance analysis tools such as PAPI and CrayPat, we gave quantitative measures of both static and temporal memory access bahavior. We found that the left-looking factorization incurs much less memory traffic than the right-looking one, therefore, it performs better on the CMP systems with limited memory bandwidth. We believe this performance characteristics is very likely associated with the other right-looking algorithm variants, such as a multifrontal algorithm. We also quantified that the arithmetic density of the triangular solution algorithm can be over an order of magnitude lower than the preprocessing and factorization algorithms. The Pthreads code is usually more robust and delivers consistently better performance than the MPI code, particularly on a multicore+multithreading architecture, such as Sun VictoriaFalls. These suggest that it will be beneficial to use hybrid programming model, to design hybrid algorithms, and to provide hybrid device mode for MPICH.
In the future, we plan to continue using the performance tools to refine our understanding of multicore scaling, and find ways to enhance performance.
