In this paper we present a static scheduling algorithm for parallel sparse LU factorization with static pivoting. The algorithm is divided into mapping and scheduling phases, using the symmetric pruned graphs of L T and U to represent dependencies. The scheduling algorithm is designed for driving the parallel execution of the factorization on a distributed-memory architecture. Experimental results and comparisons with SuperLU DIST are reported after applying this algorithm on real world application matrices on an IBM SP RS/6000 distributed memory machine.
Introduction
To factorize unsymmetric and non-definite matrices, many solvers rely on partial pivoting to maintain numerical stability. In that case, the structures of the L and U factors depend both on the structure of the initial A and on the row interchanges induced by partial pivoting. Thus the L and U structures cannot be determined before the numerical computation of the factors. In particular, this implies that some dynamically changing data structures should be used, together with a dynamic scheduling algorithm that identifies data dependencies while the numerical computation is in progress. In a distributed memory environment this approach has been observed not to scale so well [1] .
A possible way to address the scalability issue is to first evaluate to what extent replacing the partial pivoting with other, more static techniques, can help maintain numerical stability. Li and Demmel show that such techniques are indeed possible [12] . Their proposed SuperLU DIST solver uses a 2D distribution of the sparse matrix on a 2D grid of processors and is intended for large-scale distributed-memory machines. This solver is highly parallel and its ability to scale proportionally with the matrix size has been tested on a set of large matrices from various application domains. For almost all the matrices, the 2D distribution led to a very good load balance. However, exceptions have been noted with some matrices, for which a significant load imbalance was observed on 64 processors. Experiments using the 64 processors of a CRAY T3E [11] showed that the time spent in communication or synchronization represents a significant percentage of the total execution time. Even for the matrices for which the algorithm scales well up to 128 processors, more than 50% of the factorization time was spent waiting to receive messages.
These experiments suggest that further refinements are needed in order to improve the parallel efficiency. Several possibilities exist, among which the most relevant are: use more accurate information about the dependencies between computations, develop more sophisticated functions to map blocks to processors, consider additional scheduling techniques overlapping communication and computation.
Several mapping and scheduling algorithms have been proposed in the literature, mainly in the context of the Cholesky factorization. These algorithms use the elimination tree [14] to model the parallelism due to the sparsity of the matrix. An example of such an algorithm is the subtree to subcube mapping [7] which leads to good performance if the elimination tree is balanced. A generalization of this algorithm for unbalanced elimination tree is the proportional mapping algorithm, proposed by Pothen and Sun [15] . These algorithms map the columns to processors during a top-down traversal of the elimination tree, balancing the load while minimizing the communication.
Starting from these algorithms, a static scheduling algorithm was developed in the PaStiX solver [9] . This algorithm is based on two distinct phases. The first phase (partitioning phase) assigns to each supernode j a set of candidate processors. They are obtained by using the proportional mapping algorithm of Pothen and Sun [15] . The second phase (mapping phase) simulates the parallel factorization to order the tasks associated with the computation of each supernode. These tasks are assigned to a subset of processors from the set of candidate processors. Thus, the computation of each supernode will be assigned to only one processor (in 1D distribution) or several processors (in 2D distribution), depending on the workload associated to this supernode and the number of processors in the set of candidate processors.
In this paper we present a static scheduling algorithm for the sparse LU factorization. The main goal is to reduce as much as possible the communication, while balancing the load. Similar to the PaStiX solver, the algorithm uses the first phase to map the supernodes to processors, followed by the second phase to schedule the tasks. Our contributions are as follows. In contrast to most of the existing algorithms which use a tree to represent the dependencies between tasks, we use the symmetric pruned graphs of L T and U to represent accurate dependencies between tasks. For the first phase, we develop a generalization of the proportional mapping, and the resulting algorithm represent one of the main contributions of the paper. For the second phase, we use a list scheduling heuristic. We show that this scheduling algorithm is especially effective to improve the scalability on large number of processors for very sparse matrices. Compared with the 2D mapping, the number of messages is significantly reduced from between 9023 and 50682 to between 615 and 3856 on 64 processors. Moreover, the proposed mapping and scheduling algorithm is fast, with its time complexity linear in the size of the input DAG, i.e., the symmetric pruned graph of U .
The rest of the paper is organized as follows: Section 2 introduces the dependency graphs for the LU factorization with static pivoting, and describes the scheduling algorithm for the graph to minimize the execution time. Experimental results and comparisons with SuperLU DIST solver are presented in Section 3, followed by the conclusions and future work in Section 4.
The mapping and scheduling algorithm
Let a square matrix be partitioned into blocks of submatrices. This is usually obtained by partitioning the columns using the unsymmetric supernodes (columns of L with the same nonzero structure [2] ). After that, the same partitioning is applied to the rows of the matrix to further break each supernode into blocks of submatrices. We denote by U kj (L kj ) a submatrix of U (L) at row block index k and column block index j. For each column block, we identify two types of tasks. Task Factor(k) factorizes the column block k and exists for each 1 ≤ k ≤ N . Task Update(j, k) updates block column k by block column j and exists for j < k and U jk = 0. 1 The sparse LU factorization algorithm can be described as:
We use two phases to distribute the data and to schedule the computations, as in the PaStiX solver. In the first phase we assign a set of candidate processors to the computation of each supernode. In the second phase we schedule the computations in order to minimize the execution time. During the second phase, the computation of each supernode is scheduled on a processor from its candidate processors set.
The two phases of the scheduling algorithm use the dependencies between tasks. In the case of an unsymmetric matrix A, several tools can be used to represent the dependencies:
• the elimination tree of A + A T [14] ;
• the elimination DAGs of L T and U [8] ;
• the symmetric pruned graphs of L T and U [4, 5] .
The elimination tree of A + A T is a structure offering simple manipulation and access. This tree can be used to represent the dependencies between supernodes, but it overestimates these dependencies. The elimination DAGs of L and U are the transitive reductions of the graphs of L and U , and are a compact way of representing all dependencies between computations, and only those dependencies. But a major disadvantage is their significant amount of construction time.
A good tradeoff is to use the symmetric pruned graphs of L T and U . These graphs can be built very efficiently [4, 5] . Even if they introduce redundant dependencies compared to the elimination DAGs, we experimentally observed that they contain few redundant edges. Thus, they can be used effectively. However, due to their simplicity, we still use the elimination DAGs to develop the theoretical results. Knowing that all these results are also valid if used with the symmetric pruned graphs, our experimental results are based on the symmetric pruned graphs of L T and U .
Finding a set of candidate processors
The goal of the first phase is to assign a set of processors to each supernode, such that these processors can participate efficiently in the factorization of the supernode.
For symmetric matrices, the proportional mapping algorithm [15] attempts to map dependent computations on the same processor while balancing the load. This algorithm uses the elimination tree to represent the dependencies between supernodes. It starts by assigning all processors to the root. Then it considers the subtrees of the root in descending workload order. It assigns to each subtree a subset of processors proportional to the workload of the subtree. The algorithm continues until only one processor is assigned to each subtree. The advantage of this algorithm is that the communication in a subtree is restricted to only between the processors assigned to this subtree, thus ensuring a low communication cost.
For unsymmetric matrices, we are interested in using more accurate information on the dependencies between supernodes, and in this case the elimination DAG of U is helpful. This led us to consider a generalization of the proportional mapping algorithm to the DAGs. At each step of the algorithm, we consider a node i of the graph and its set of processors. Our goal is to assign to each predecessor j of node i a subset of i's processors, such that the communication between node i and its predecessors is minimized.
When the proportional mapping algorithm is applied to the elimination tree, each node j has only one successor i, and assigning to node j a subset of processors from the set of processors of i will reduce the communication. In the case of the elimination DAG, a node j has several successors, and it can communicate data to all these successors. The basic idea of our approach is to assign to node j several processors from its successors. We assign more processors from the successors with whom node j communicates more. To do so, we include a proportion of the workload associated with j to the workload associated with each of its successor i. This proportion depends on the communication between j and i.
Since we only want to consider communication between j and its immediate successors, a new problem arises. Some of the edge j → k in the graph of U disappears and occurs as a path of length greater than one in the elimination DAG (e.g., j → i → k). So node j needs to update its immediate sucessors and some of its ancestors in the elimination DAG. How do we estimate the communication between node j and its immediate successor i in the elimination DAG? We propose to approximate this as follows. We first compute total communication volume to node i, including all the incoming edges to i in the graph of U . We then divide this amount equally among i's immediate predecessors in the elimination DAG. In other words, all the communication from i's descendents are treated as the communication from i's immediate predecessors, so that the global communication information is retained even in the presence of graph contraction.
Algorithm 1 considers the floating-point operations to factorize a supernode (F SN), the number of predecessors of each supernode in the graph (noP red), the volume of communication involved in the factorization of supernode i (avgCommIn[i]), and the volume of communication from supernode j to its successors (commOut[j]). Using this information, the algorithm computes the workload associated with each supernode (W SN), taking into account the workload of its predecessors and the communication with its predecessors.
After computing the workload associated with each supernode, we can compute the set of candidate processors using Algorithm 2. This algorithm takes as input the dependency graph G, annotated with the workload and communication, and returns for each supernode 
Algorithm 1 Compute the communication and the workload of each supernode
; end for end for a set of candidate processors. We add to the dependency graph G a sink supernode N + 1, which is used to start the algorithm.
Algorithm 2 is a generalization of the proportional mapping algorithm of Pothen and Sun [15] . The main difference is that for each node i, it considers a proportion of the workload of each predecessor to assign a subset of its processors, taking into account the volume of communication. Note that if the input graph G is a tree, the algorithm is identical to the proportional mapping algorithm. If G is a tree and i is the parent of j in this tree, then commOut[j] will be equal to avgCommIn [i] . Thus the workload associated with supernode i, denoted as W SN [i] , is equal to the sum of the workloads associated with all the nodes belonging to the subtree rooted at i, similar to the proportional mapping algorithm of Pothen and Sun. Now we illustrate the execution of this algorithm on an example matrix B in Figure 1 , where each column corresponds to a supernode and each nonzero element corresponds to a block. with the subtree rooted at supernode i, which is the sum of the workloads of all the nodes belonging to the subtree rooted at i. At the left of each node i, we represent the set of candidate processors, which later can take part in the computation of supernode i. For example, node 5 disposes of 4 processors, namely P 0 , P 1 , P 2 , P 3 . This node has two sons, 3 and 4. The load of the subtree rooted at 3 represents 3/4 of the total load of the subtree rooted at 5, while the load of the subtree rooted at 4 represents 1/4 of this load. Thus, node 3 will receive three processors P 0 , P 1 , P 2 , while node 4 will receive the processor P 3 .
Second, we illustrate the execution of our proportional mapping algorithm on the same example matrix B in Figure 1 . The elimination DAG of U is shown in Figure 3 . Again, the nodes of this graph represent supernodes of B. In Figure 3 we label at the right of each node its corresponding values in the arrays W SN and F SN. These values are used in Algorithm 2. Moreover, avgCommIn[i] is labeled on each edge to i, thus the value commOut[j] is the sum of the outgoing edges from j. As an example, the value of the load W SN associated with node 8 is obtained by doing the sum 50 + 200 + (300 * 6/9) + (250 * 6/30) = 500.
In Figure 4 we present the set of candidate processors assigned to each node by Algorithm 2. In this figure, node 10 was added as a starting point for the execution of the algorithm. Consider node 2. This node communicates mainly with its successor 3, thus it receives a processor from node 3 and no processor from node 5. Node 3 communicates roughly the same amount of data to its two successors, and thus it receives a processor from each one of its successors 8 and 9.
By comparing the results of the executions when using the elimination tree of B + B T P P WSN 80 P P P P P P P P P P P P P P 0 1 P P P P 0 1 2 3 P P P P ( Figure 2 ) and the elimination DAG of U (Figure 4 ), we notice that our approach using the elimination DAG tries to minimize the communication by considering the dependencies between supernodes and their communication. Node 3 in Figure 2 has received a subset of processors from node 5, even if there is no dependency between supernodes 5 and 3. This is due to the fact that the elimination tree of A + A T introduces false dependencies between supernodes.
Scheduling the computations
After the phase for mapping a supernode to a set of candidate processors, we perform scheduling. We use a list-based scheduling heuristic [6, 16] in this phase, which assigns the tasks associated with the computation of each supernode to a processor from the set of candidate processors. A left-looking approach with a 1D distribution of the data is used: the computaion of a supernode k is assigned to one processor; the updates to supernode k (Update(*, k)) are executed just before the factorisation of this supernode (Factor(k)).
With this approach, the elimination DAG of U gives all the dependencies between supernodes computations, and hence it can be used as a dependency graph in the list scheduling algorithm. Each node of this graph corresponds to a supernode. An entry supernode is defined as a supernode with no incoming edges, while an exit supernode is defined as a supernode with no outgoing edges. During the scheduling, a supernode becomes ready when all its predecessors were scheduled.
Each supernode has a computation cost, which is the number of floating-point operations performed for this supernode. Each edge is assigned a communication cost, which is equal to the volume of data transferred between its two end supernodes. Using this information, we assign the priority of each supernode as the longest path from this supernode to an exit supernode. The length of the path is the sum of the communication and the computation costs of its constituent edges and supernodes.
For each processor we maintain a list of ready supernodes and the processor's start time. At each iteration of the scheduling loop, we select the processor with the earliest start time. From its list of ready supernodes, the supernode with the highest priority is selected (say supernode k), and needs to be assigned to one of its candidate processors that can execute it at the earliest time. Thus, we attempt to assign the supernode to each processor in its set of candidate processors, and determine on which processor the computation of this supernode can have the earliest start time. The supernode is then mapped on the selected processor and the tasks associated with its computation (tasks Update(*, k) and Factor(k) can be scheduled as soon as the computation of the last supernode assigned to that processor is finished. To overlap computations and communications, we consider a valid order between tasks Update(*, k), and we schedule them in the order in which the data needed from its predecessors has arrived locally; we discuss the valid order further in this section.
The processor start time is updated as the finish time of the supernode scheduled on that processor. If a successor of the current supernode becomes ready for execution, then it is added to the ready list of every processor in its candidates set. The scheduling loop is repeated as long as there exist unscheduled supernodes.
A valid order between tasks Update(j, k) is given by the ascending order of the indices j. Using the elimination DAG of L T , we can obtain a valid order which uses more accurate information on the dependencies between these tasks.
The following lemma gives the dependency between the updates from two supernodes i, i to supernode k, where i, i are linked by a path in the elimination DAG of L T .
Lemma 1 Consider supernodes i, i and k such that i is the successor of i in the elimination DAG of L T and tasks Update(i, k), Update(i', k) exist. Then task Update(i, k) has to be completed before task Update(i', k) can start its execution.
Proof As i is the successor of i in the elimination DAG of L T , then block L i i is nonzero. As tasks Update(i, k), Update(i', k) exist, we can deduce that blocks U ik , U i ,k are also nonzero.
Task Update(i', k) can begin its execution as soon as all the updates to block U i k are finished. As task Update(i, k) is one of these updates, then this task must modify block U i k before task Update(i', k) can start its execution. 2
Let i, i be two supernodes such that i < i . If there is no path from i to i in the elimination DAG of L T , then block L i i is zero. Consider another node k such that tasks Update(i, k), Update(i', k) exist. Then there is no dependency between the two tasks, as task Update(i, k) does not modify block U i k necessary for the execution of task Update(i', k).
From these results, the dependencies can be defined as follows :
• There is a task Update(i, k) for each U ik = 0 and 1 ≤ i < k ≤ N .
• There is a dependency from Update(i, k) to Update(i', k) if i is the successor of i in the elimination DAG of L T .
• There is a dependency from Update(i, k) to Factor(k) if i is an exit node, or the smallest successor j of i in the elimination DAG of L T is no smaller than k.
We illustrate in Figure 5 these rules by computing the dependencies between the tasks associated with supernode 8 for the matrix in Figure 1 . Before factorizing this supernode (task Factor (8) in the dependency graph), supernode 8 is updated by supernodes 1, 2, 3, 5 and 7. A valid order of these updates is given by the ascending order of the source supernodes in the updates 1, 2, 3, 5, 7. However, an order exhibiting more parallelism is the order using the elimination DAG. Hence, there is a dependency from Update (1, 8) to Update (3, 8) because 3 is the successor of 1 in the elimination DAG of L T (the elimination DAG of L T is presented in Figure 6 ). But there is no dependency between Update (2, 8) and Update (3, 8) because there is no path from 2 to 3 in the elimination DAG of L T .
Update (1, 8) Update (3, 8) Update (2, 8) Update (5,8) Update (7, 8) Factor(8) Figure 1 
Experimental results
In this section, we present the experimental results obtained when applying the new scheduling techniques on the real world matrices. We tested the new factorization method on an IBM SP RS/6000 distributed memory machine at NERSC. pute processors distributed among 184 compute nodes. Each processor is clocked at 375 Mhz and has a peak performance of 1.5 GFlops. Each node has 16 to 64 Gbytes of shared memory. We used several medium and large matrices from a variety of application domains. These matrices and their characteristics are presented in table 1, which includes the matrix order, the number of nonzeros in the matrix A, the number of nonzeros in the factors L and U , and the number of floating-point operations. Table 2 presents the size of the supernodal graph and its symmetric pruned graph. The second and the third columns list the number of nodes and edges in the supernodal graph of U . The fourth column lists the number of edges in the symmetric pruned graph of U . The fifth and the sixth columns list the number of entry and exit supernodes in the symmetric pruned graph of U . For all our test matrices, the supernodal symmetric pruned graph of U is much smaller than the supernodal graph of U . Very often, there are one order of magnitude fewer edges than in the supernodal graph of U . Three of the test matrices (ex11, venkat01 and wang4) are structurally symmetric, in which case the pruned graph is a tree. Computing the symmetric pruned graph takes very little time, and this time is included in the scheduling time overhead reported in Figure 7 and Table 5 .
We now compare the performance of the new factorization algorithm (referred as SCHED) to the factorization algorithm in SuperLU DIST (referred as SLUD). In particular, we compare the load balance, the amount of communication and the runtime. Table 3 : Load balance results.
For both algorithms, the preprocessing steps are the same. These include a step to permute large entries on the diagonal (using the routine MC64 [3] ), followed by a symetric permutation to preserve the sparsity (using multiple minimum degree algorithm applied on A + A T [13] ) and the symbolic factorization to get the structures of L and U . Only the numerical factorization phase is different in the two approaches. This includes the matrix distribution and the actual factorization. After the preprocessing steps, SLUD distributes the data among processors using a 2D block-cyclic distribution on a 2D grid of processors. Locally on the set of owner processors, each supernode of L is stored in a column oriented format, while each supernode of U is stored in a row oriented format. This storage scheme fits well the right-looking factorization. In SCHED, a master processor executes the scheduling algorithm, and then sends the necessary information to all the other processors to guide the numerical factorization. Each supernode is distributed on its owner processor, and is stored using a column oriented format, for both L and U . This storage is well adapted to the left-looking factorization. After the distribution, the numerical factotorization is performed using the valid task order established by the scheduling algorithm.
To evaluate the load balance, we consider the load associated with a processor as being the number of floating-point operations performed on this processor. As described in [11] , the load balance factor can be computed as the average load divided by the maximum load among all the processors. Thus, the closer is this factor to 1, the better is the load balance. Table 3 shows the load balance factors. Compared with the 2D block-cyclic mapping, the proportional mapping algorithm usually improves load balance, with very few exceptions. Table 4 compares the amount of communication of the two algorithms. For each matrix, we report the average communication volume and the average number of messages per processor. For all the test matrices, SCHED leads to a large reduction in the number of messages. Usually the average number of messages increases with the increasing number of processors up to 16 or 32, and then it starts decreasing. On the other hand, the volume of communication for SCHED is not always smaller than for SLUD. Sometimes SCHED is better and sometimes SLUD is better. But the difference is not dramatic. The worst P = 4 P = 16 P = 32 P = 64 P = 128 af23560 SCHED Vol case is matrix wang4, for which the SCHED's communication volume is twice more than that of SLUD. These results imply that in SCHED, the message size is usually much bigger than that in SLUD. This is mainly because using the 1D distribution in SCHED, a message contains an entire supernode k of L. Whereas in SLUD, a message contains only a part of supernode k of L. Therefore, it is important to overlap computation with communication so to avoid the idle time waiting for the messages. This is well addressed in the new scheduling algorithm. Finally, we compare the actual runtimes in Figure 7 . (The runtimes are also tabulated in table 5.) Each plot in the figure corresponds to one matrix with varying number of processors. Since the two algorithms differ in matrix distribution and numerical factorization, we separately report the distribution time (labeled "dist") and the factorization time (labeled "fact"). We also report the total time which is the sum of the two. SCHED also needs to pay a small cost of scheduling overhead. We report this separately (labeled "schedule"). The reason we do not include this in the total time for SCHED is that when the matrices of the same nonzero structure are factorized multiple times, the scheduling algorithm will only be invoked once, and hence the cost is very small.
On smaller number of processors (less than 16), the distribution time for SCHED can be drastically smaller than that for SLUD, such as matrices af23560, onetone1 and onetone2. This is due to the fact that storing both supernodes of L and U in a column oriented format can lead to a more efficient distribution algorithm. But with increasing number of processors, the distribution time for SCHED increases, while for SLUD it decreases. On one processor, the time difference is only in the distribution step. The factorization speed is about the same for both codes.
When comparing the total time of both distribution and factorization, SCHED is faster than SLUD for 6 matrices; it is more than twice faster for matrix venkat01. We observed more improvement on a large number of processors, where the total number of messages usually increases. In this case it is even more important to reduce the number of messages, and thus SCHED approach is effective. The time continues to decrease when increasing the number of processors up to 64. Beyond 64 processors, only the time for bbmat continues to decrease. This implies that this set of test matrices is not large enough to demonstrate scalability of the algorithms. In the future, we will test larger matrices.
For ex11 and bbmat, SLUD is faster than SCHED. These two matrices are relatively denser than the other matrices. We suspect that these matrices exhibit limited amount of parallelism to be exploited in a left-looking algorithm with a 1D distribution. The 2D block-cyclic distribution used by SLUD can exploit more parallelism. As part of the future work, we will study the performance impact on denser matrices when using a 1D partition or a 2D partition (the 2D partition was shown to be more scalable for dense matrices). We will also evaluate the paralelism available in a left-looking algorithm versus a right-looking algorithm.
Conclusions and future work
In this paper we present a new assignment and static scheduling algorithm for sparse LU factorization with static pivoting. This algorithm uses the symmetric pruned graphs of L T and U to represent the dependencies between computations, thus exploiting the parallelism due to the sparsity and asymmetry of the matrix. Experimental results show that our approach leads to a large reduction in the number of messages, and for very sparse matrices the performance compares favorably to that of the SuperLU DIST solver on the IBM SP RS/6000 machine. Furthermore, the proposed scheduling algorithm is easy to implement and is fast, with its time complexity linear in the size of the input DAG, i.e., the symmetric pruned graph of U .
In an earlier work comparing SuperLU DIST and MUMPS [1] , it was found that MUMPS is faster for smaller numbers of processors (e.g., up to 64 on a Cray T3E), but SuperLU DIST is faster for larger numbers of processors and shows better scalability. The new factorization algorithm SCHED in this paper usually performs better than SuperLU DIST, especially for sparser matrices and larger machines. Therefore, it compares favorably with MUMPS for large numbers of processors.
Future work remains to improve the performance of the new approach and several avenues can be explored. A more accurate performance model should be developped in order to effectively use the list scheduling algorithm and to reduce the processor's idle time. More optimizations can be done to better overlap computation and communication. Methods for Table 5 : Schedule time in seconds (schedule), total numerical factorization time in seconds (including data distribution time) on the IBM SP RS/6000.
controlling the memory requirement on each processor will be analyzed and implemented, which can improve the memory usage of the left-looking scheme.
To speed up the numerical factorization for denser matrices, we plan to extend our methods so that both 1D and 2D distributions will be used. During the list scheduling algorithm, the computation of each supernode will be assigned to one processor (in 1D distribution) or several processors (in 2D distribution), depending on the workload associated to this supernode and the number of processors in the set of candidate processors. Thus both task and data parallelism will be exploited in the program.
One final remark is about the choice of the algorithm -when to use SCHED and when to use SLUD, since both have merits. As we mentioned earlier, we think SCHED performs better for sparser problems. So we sort the matrices in terms of density, which is defined as nnz(L + U )/n 2 , and plot the performance gain of SCHED over SLUD in figure 8 . If our conjecture was true, each line would increase monotonously with increasing density. We somewhat see this trend, with some exceptions, such as matrices af23560 and wang4, corresponding to the two dips in the plots. These two matrices are relatively dense, but SCHED performs better. We admit that the performance gain is a complex function of the input matrix and the two different algorithms. More factors than just sparsity affect the performance. It remains an open question to predict performance of different algorithms from the input matrix in order to help choose the right algorithm. For this, we plan to include a much larger set of matrices and analyze the global trend. 
