Abstract-All many-core systems require fine-grained shared memory parallelism, however the most efficient way to extract such parallelism is far from trivial. Fine-grained parallel algorithms face various performance trade-offs related to tasking, accesses to global data-structures, and use of shared cache. While programming models provide high level abstractions, such as data and task parallelism, algorithmic choices still remain open on how to best implement irregular algorithms, such as sparse factorizations, while taking into account the trade-offs mentioned above. In this paper, we compare these performance trade-offs for task and data parallelism on different hardware architectures such as Intel Sandy Bridge, Intel Xeon Phi, and IBM Power8. We do this by comparing the scaling of a new taskparallel incomplete sparse Cholesky factorization called Tacho and a new data-parallel incomplete sparse LU factorization called Basker. Both solvers utilize Kokkos programming model and were developed within the ShyLU package of Trilinos. Using these two codes we demonstrate how high-level programming changes affect performance and overhead costs on multiple multi/manycore systems. We find that Kokkos is able to provide comparable performance with both parallel_for and task/futures on traditional x86 multicores. However, the choice of which highlevel abstraction to use on many-core systems depends on both the architectures and input matrices.
I. INTRODUCTION
A range of different many-core architectures currently exist. All these architectures require fine-grained shared-memory parallelism. Many programming models, such as OpenMP [1] , OpenCL [2] , Kokkos [3] , and Cilk [4] , target shared memory parallelism with the goal of portable performance. Ideally, algorithm developers would like one portable implementation that would scale well on most many-core systems. While the programming models provide high level abstractions such as parallel-for for a fork-join kind of "data-parallelism" and task/futures style abstractions that express dependencies between data or computational kernels for asynchronous "task-parallelism", achieving good parallel performance on various hardware architectures depends on several other factors; e.g., the overhead of scheduling, granularity of the kernels and the data they work with, how they fit in cache, and how much and how often data is shared globally between the kernels. These problems are exacerbated when the target application is very irregular in its access pattern, such as sparse matrix factorization, and the target architecture is not one platform but a variety of platforms. We address this problem in a systematic way by exploring these trade-offs using task and data parallelism on sparse incomplete factorization on Intel Sandy Bridge, Intel Xeon Phi, and IBM Power8.
We compare two newly developed solver codes called Tacho [5] and Basker [6] in the ShyLU [7] package of Trilinos. Tacho provides task-parallel incomplete sparse Cholesky factorization and Basker uses a data-parallel approach for sparse incomplete LU factorization. Both codes utilize the Kokkos shared memory model to allow execution on multiple many-core systems. The only selection made differently in Kokkos is the choice between using parallel_for or task/futures. An introduction to the algorithm choices in these solvers is presented in Section III. A detailed explanation of these algorithms is not in the scope of this paper. While it would be ideal to compare one factorization algorithm, say either LU or Cholesky, with two different approaches of parallelism it would lead to some redundant work in replicating the same functionality. Moreover, even a highlevel choice would require a great deal of code restructuring. Both these factorizations are complex enough to discourage such a rewrite. Instead, we use both packages on a set of input matrices that reduce the differences between algorithms. As a result the performance differences could be explained by the implementation choices. We also consider the fact that they do different number of floating point operations and adjust the metrics accordinlgy. We use these methods on different hardware architectures in order to demonstrate their performance characteristics, scaling behavior, and analyze how the trade-offs and algorithm choices at the higher level impact performance in these systems. In addition, we demonstrate that the trade-offs are not only architecture specific, but they are input-dependent as well with performance comparison for different sparse matrices. This variation is often missed in other benchmark applications considering a single input.
The rest of the paper is organized as follows. We first provide a brief background on task and data parallelism, Kokkos, and the test bed systems used in Section II. Section III provides an overview of the algorithm choices of Tacho and Basker. Next, a general performance study is given to demonstrate similarity and differences of the two codes on varies systems. An analysis of design choices is given to provide insight into the differences demonstrated in the general performance study in Section IV. We finally review related work in Section V, and provide conclusions of this work in Section VI.
II. BACKGROUND
Task-parallelism has been successfully used for irregular problems such as sparse direct factorizations [8] , [9] , [10] , [11] and graph algorithms [12] . Task-parallelism allows for a (central/distributed) tasking system to keep threads busy, and is ideal when tasks may have slight imbalance due to sparsity. However, a central tasking system comes with an overhead from threads accessing the work queue. While a distributed queueing system alleviates this problem, there is still measurable overhead from the tasking system. NUMAaware [12] , [13] , [14] and work stealing [4] based approaches have been shown to make sufficient improvement to tasking for irregular problems. Even with these improvements dataparallelism is the natural way to express the parallelism for very regular problems with the least overhead. Our problem of choice, however, is highly irregular making it more attractive for task parallelism.
Data-parallelism on many-core processors requires dividing a problem in a traditional SIMD pattern. This tends to lead to a static assignment or somewhat limited dynamic assignment of problem to threads leading to very little overhead costs. Packages, such as Cilk [4] and Kokkos [3] , provide an interface that allows programming with data parallelism, even in a hierarchical manner, on many-core systems. These hierarchies can allow for efficient use of multiple light-weight threads per core and synchronizations between subsets of threads called thread teams. In terms of irregular problems, it has been shown that shared memory programming models improve the performance of highly irregular graph algorithms both in multicore [15] and many-core systems [16] . However, sparse factorizations have very different data access patterns. For example, while it is natural to express the irregular graph algorithm as an edge based algorithm the dependency in sparse factorizations cannot be easily expressed at the non-zero level.
There are not many shared memory incomplete sparse factorization libraries. One reason is the difficulty to develop a method that can scale well. The other reason being the incomplete factorization is typically computed within an MPI rank allowing users to use a sequential incomplete factorization. However, next generation systems will contain manycore processors with serial execution being slower and huge performance penalties, and thus requiring a parallel incomplete sparse factorization code. A distributed memory implementation of incomplete factorization exists in Hypre [17] and comparison against Tacho in [5] shows Tacho's highly optimized performance. Complete shared-memory sparse factorizations exist in SuperLU-MT [18] and Pardiso [19] .
Kokkos. Kokkos is a templated C++ library in Trilinos that allows users to write thread-scalable code for different architectures such as CPUs, Xeon Phis, and GPUs. It provides support for multiple backends, such as OpenMP, Pthreads, Qthreads, and CUDA. Kokkos was chosen as it allows for a performance portable method for our solvers with few changes. Currently, only data-parallelism is supported with all backends, which is used by Basker. Task-parallelism is being added to support execution from an implicit DAG that is scheduled by a runtime system on behalf of the application using Pthread and Qthread backends. For example, consider the problem of executing a binary dependency tree, i.e., P 0 ← P 2 → P 1 where P 0 and P 1 depend on P 2 . Algorithm 1 provides the pseudo-code of how this would be execute in a data parallel manner. In contrast, Algorithm 2 provides pseudo-code of how this can executed using task-parallelism in Kokkos. Tacho is was first designed to demonstrate the power of the new addition. Kokkos uses C++ functors to implement both data and task parallelism. Moreover, Kokkos transforms multidimensional arrays of data to fit the required system layout and promotes vectorization. Many-Core Systems. Current high performance computing systems contain heterogeneous many-core nodes. In this paper, we focus on three different multicore and many-core systems: traditional multicore architectures, i.e., Intel Sandy Bridge and IBM Power8 and a coprocessor architecture, Intel Xeon Phi (Knights Corner). We do not test the popular NVIDA GPU style nodes as incomplete sparse factorization is not well suited for this device, and most GPUs do not yet support tasking.
The Sandy Bridge system used for our test has two 8-core Xeon E5-2670 running at 2.6GHz. These two processors are interconnected using Intel's QuickPath Interconnect (QPI) to speedup cache hits in other processor's memory hierarchy. Three levels of cache exist with a private 256KB L2 and a large shared 20MB L3. We note that large SIMD units or high performance BLAS operations for which this type of processor is known for offers limit benefits to the very sparse operations in our applications. The Power8 system contains two sockets with each containing two Power8E processors with five cores running at 3.4GHz. Each socket has two non-uniform memory access (NUMA) regions with its own memory controller. This results in higher latency between processors within a socket and between sockets than local accesses. In each NUMA region, there exist three levels of cache with a private 512 KB L2 and the L3 being a collection of shared 5×8MB segments.
The Intel Xeon Phi coprocessor used in our tests is a Knight Corner with 57 cores running at 1.1GHz. We will only use up to 32 threads for Basker as it is limited to powers of two, and oversubscription of cores was found to harm performance in this application. Tacho will use all 57 cores for this evaluation. However, future work is being done to allow Tacho to use data-parallelism within a core in order to provide oversubscription in the future. The coprocessor has two levels of cache with the L2 being shared. However, due to its segmented layout like Power8, the shared L2 has varying latency depending on location. All these small changes in architectures, such as how the caches are laid out and access costs for global data structures between threads, play a big role in the performance of sparse factorizations. Understanding the trade-offs in relationship to high-level programming choices is the primary focus of this study.
In all our comparisons, we map one thread to a single core. The reason for this is that both task and data parallel implementations of sparse matrix factorization do not exploit additional nested parallelism which is important to effectively utilize hardware threads.
III. PARALLELIZATION STRATEGIES
In this section, we briefly introduce the parallelization strategies of two incomplete sparse factorization codes in the ShyLU [7] package of Trilinos [20] . Detailed explanations of Tacho [5] and Basker [6] can be found in their individual reports. Incomplete factorizations are commonly used as preconditioners in iterative linear solvers. Many types of incomplete factorization exist such as those based on fill level, tolerance, and multiple level of structures. We will only focus on fill level-based methods where the nonzero pattern is based on levels of fill-in, i.e., zero elements becoming nonzero during factorization, from the elimination tree. We will only compare with symmetric positive definite matrices, so that LU = LL T with no pivoting, and making Tacho and Basker similar.
A. Tacho: Task-parallel Cholesky
Our task parallel Cholesky factorization (A = LL T ), called Tacho [5] , is a right-looking algorithm based on algorithmsby-blocks style factorization [21] . This class of algorithms decomposes matrix factorization into a set of tasks with dependencies by viewing the matrix as a collection of blocks. This concept has been used in parallelizing out-of-core dense linear algebra libraries [22] , [23] .
These blocks are found in the following way. A sparse matrix is reordered by using nested dissection [24] to increase
Fig. 1. Tasks generated while proceeding Cholesky-by-blocks. A sparse matrix is organized by blocks and a member block itself is sparse.
the degree of concurrency, which is commonly done by other parallel factorization codes [19] . The nested dissection algorithm recursively bisects the adjacency graph of the matrix into subgraphs, which results in a tree hierarchy (See Figure 2(a) ). Based on the hierarchy, the matrix is reorganized as a matrix of two-dimensional blocks. A parallel symbolic factorization is performed using a breath-first search before numeric factorization in order discover the nonzero pattern using the elimination tree. During numeric factorization, Cholesky-by-blocks generates tasks with dependencies. Note that the resulting tasks are not simply based on the nested dissection tree. Instead, tasks are related by the algorithm-by-blocks with in/out data blocks, and the tasks are asynchronously scheduled by one thread using futures with Kokkos. By generating tasks using algorithmby-blocks more tasks can be generated and no thread will have to stall waiting for work. For example, Figure 2 (a) shows how few tasks would be generated by a one-dimensional layout from the nested-dissection tree. However, Figure 2 (b) shows only a fraction of the tasks that can be generated from algorithm-by-blocks. Once this thread is done scheduling it joins the other working threads to fulfill outstanding tasking. Working threads take work from the queue, and new tasks are scheduled as soon as their dependencies are satisfied. For example, Figure 1 illustrates a list of tasks generated from Cholesky-by-blocks. One thread will add all the task/dependencies in the right side of Figure 1 . One can easily notice that the Cholesky factorization on A 00 , A 11 , and A 22 can be performed together; after A 22 is updated, the triangular solve on A 23 and A 24 can be simultaneously executed before all antecedent tasks are finished. Threads will take these tasks from the queue as soon as they have no work and all dependencies on the given task are fulfilled. For more information on the operations and scheduling of the tasks please see [5] .
B. Basker: Data-parallel LU
The data-parallel incomplete LU factorization is called Basker [6] . Basker was a full factorization code based on a Gilbert and Peierls algorithm [25] designed for low fillin matrices, such as matrices from circuit simulations. The Gibert and Peierls algorithm finds the sparsity pattern onthe-fly during the factorization (See Algorithm 3). Note, due to our matrix choices, U = L T . A global array (marked) is maintained to keep fill counts, and nonzero location is found during factorization using multiple parallel depth-first searches. Algorithm 4 provides the depth-first search where L.f ill is a local variable and marked is shared between all threads.
Algorithm 3 Incomplete Gilbert-Peierls Algorithm
Find topological order of fill-in pattern of with depth-first search using shared array marked → pattern k 4: for all j ∈ pattern k do 5:
Copy from A(:, k) to L and U based on P 9: end for This is the first parallel implementation of Gilbert-Peierls. Data parallelism is achieved by dividing the matrix into a hierarchy of two-dimensional blocks of sparse submatrices. For general coefficient matrices, Basker uses a hierarchy of twodimensional submatrices generated by using a combination of block triangular form and nested dissection. However, for this set of matrices only nested dissection is used in order to match the ordering of Tacho. The resulting coefficient matrix from reordering with nested dissection is given in Figure 3 (a). The coloring provides one possible mapping to threads.
In order to execute in a data-parallel manner, sets of independent computations needed to be found, i.e., level-set. The tree in Figure 3 (b) provides level-sets that can be executed in parallel using data-parallelism in a bottom-up manner. A thread can compute all matrices in a node column-by-column independently from those in a node on the same level. Because submatrices in the same column, such as U 13 and U 23 , are in different nodes, multiple threads work on each column of the input matrix A to compute the corresponding column of L and U factors. Having multiple threads work on a shared column requires very fine grain interactions between threads. For example, thread 0 and thread 1 can work on U 13 and U 23 together respectfully. When working on LU 33 , thread 0 and thread 1 must synchronize to share their finding of U 13 and
where c is a column. These dependencies leading to synchronization points handled by using thread team level barriers in place of exiting the parallel_for. For more information on optimizing synchronizations, please see [6] .
C. Comparison on Key Differences
We now outline the key difference in these two codes. The main difference between these two codes is the use of task and data parallel methods, and is the main comparison point of this work. However, smaller differences exist that should be pointed out. The first of these is that Tacho is a right-looking algorithm and Basker is left-looking algorithm. Though this does not provide any change in the number of float-point operations, it does generate more tasks for Tacho to keep the tasking queue full. A left-looking algorithm may have more difficulty in generating numerous tasks with enough work. The second is that Tacho finds its nonzero pattern from an elimination tree before numeric factorization while Basker finds its nonzero pattern using depth-first search during numeric factorization. This difference requires threads in Basker to make a greater number of global read and write accesses to share the newly discovered sparsity pattern. However, this feature is numerically better for incomplete LU where we might wish to pivot or change fill in method when factorization starts to break down. Lastly, Tacho only needs to work on L while Basker works on both L and L T . Though the found L are the same, Basker does ∼ 2× the float-point operation. We adjust for this in our analysis. We note that all of these differences are critical choices one may make when coding incomplete sparse factorization and that all of these do not directly affect the ability to identify differences in performance.
IV. NUMERICAL EXPERIMENTS

A. Test Setup
We use a test suite of symmetric positive definite matrices shown in Table I [26] . These matrices where chosen, since they have a similar number of nonzeros in level 0 (L0), i.e., no fill-in allowed, and they have very different fill-in pattern for levels greater than 0. In Table I , we provide the size, number of nonzeroes in the original matrix |A|, and the number of nonzeros in L for fill level 0, 1, and 2. Note, special care was taken during ordering to allow both codes to have similar nonzero count for each fill level. This is done by both codes using nested-dissection ordering from Scotch with a fixed seed.
We focus only on level-based incomplete factorization, where the number of nonzero elements in the resulting L factors increase with the level of fill. Note that this also results in increased work to compute the factorization.
Since the incomplete Cholesky and LU codes use different approaches to determine the location of nonzero elements, we report timings in the following ways to offset the difference in the two algorithms. Tacho uses a symbolic factorization before numeric factorization to discover the nonzero entries due to fill-in using a parallel breadth-first search. Tacho computes Cholesky factors on the upper triangular half of the matrix while Basker computes LU factors on the whole matrix. Therefore, we define the time complexity (T ) in Tacho as:
where p is the number of threads and m is a matrix for a particular test bed system. In contrast, Basker has no independent symbolic factorization step. Basker discovers the nonzero entries to fill-in during numeric factorization by each thread tallying fill-in in a shared global array. We therefore define the time complexity as:
Essentially, T (T acho, p, m) and T (Basker, p, m)
theoretically could be the same if all the overhead costs, such as scheduling costs, cache behavior, and data block sizes, are comparable between the two codes.
B. Performance
We first examine the performance of our codes over the test suite on the three different systems. This test provides us with an understanding of how well each code performs, and after we analysis implementation choices that resulted in differences. We note that Tacho and Basker have similar performance on Sandy Bridge, with Basker being slightly better. All matrices have time close to each other except Thermal2 and Pwtk with level 2 fill which have ∼ 1.8× more nonzeros than the other two matrices with fill level 2. This uniform performance is a good sign as this type of system has been the long time common architecture of most HPC systems, even though that may not be true in the near future.
However, the performance of Tacho and Basker is very different on Xeon Phi and Power8. On Intel Xeon Phi, Tacho is both faster in serial and scales better than Basker. However, Thermal2 with fill level 2 is not separated from the other matrices like on Sandy Bridge. Despite having more nonzeros, the relative number of nonzeros per column/row is similar to the others. This is surprising, since it signifies that the time on the Xeon Phi has less to do with the number of float-point operations, but how they are accessed. Also, the performance of most matrices with Basker are very similar on Xeon Phi. Again, this means there is some overhead that is hiding the time variation related to float-point calculations. On Power8, Tacho and Basker have similar performance on a single NUMA region, i.e., 4 cores. However, the performance of Basker with its global access arrays and column based reductions, i.e., multiple threads need to synchronize per column, is highly affected by its NUMA effects. We hypothesize that these differences are a result of penalties from memory accesses and parallelization approaches. On the other hand, Tacho does not share a global look-up array and all tasks generated from Cholesky-by-blocks use cache-friendly sparse kernels with blocks, which is analogous to BLAS level 3 operations in dense linear algebra.
C. Analysis
In order to better understand the difference in performance between the two codes, we examine different algorithmic overheads of each of the two methods. To facilitate our understanding, we provide two simplified equations that represent impact of implementation choices and the overhead costs or other type of penalties they will introduce in our codes. First, Tacho would ideally only be adversely affected by overhead of using a tasking system (T askingOverhead), resulting in
where Operations is the number of float-point operations. However, Basker is adversely affected by both the number of synchronizations (Sync) to reduce on columns and number of global accesses to mark the sparsity pattern (GAccessOverhead). More concisely, (GAccessOverhead(p, m) ).
We do not try to fit a model by finding α i as these are only ideal simplified models and many other variables might slightly influence performance such as L2 hit ratio or eviction of cache lines. Tasking Overheads. First, we examine tasking overhead in Tacho. Tasking overhead involves the runtime cost related to task queuing (i.e., task generation, scheduling, context switching, and destruction) and the irregular data access pattern that results from dynamic task scheduling. Here, we define the relative task overhead in the numeric factorization as
T askOverhead(m) = T (T acho, 1, m) − T (T acho, s, m) T (T acho, 1, m) ×100
where T (T acho, 1, m) uses 1 thread on the task queue and T (T acho, s, m) is without task queues. Note that tasking overhead may vary slightly with number of threads due to more simultaneous accesses. However, this variation is so minimal that we were unable to observe it. Moreover, we observed in the performance subsection that most of this overheads is amortized as the number of threads increase.
We present these values in Figure 5 for all three systems, for four matrices each with four different levels of fill. Each bar represents the task overhead for a particular matrix and a fill level. Using the same ordering of a matrix, the number of tasks generated during the numeric factorization phase is constant and its tasking overhead is also assumed to be constant regardless of the fill levels. As the fill level increases, the workload in each task also increases; thus, the tasking overhead in the numeric factorization is reduced. In particular, pwtk is relatively denser than other test matrices, see the average number of nonzero columns per row (|A|/n) described in Table I . With an increased workload per task, the pwtk matrix has much smaller tasking overhead than other test matrices, which leads to higher scalability as shown in Figure 4 . The actual tasking overhead in parallel execution may vary across the systems as their runtime systems and memory hierarchies are different. Overall, the overhead can be overlapped between tasks and Tacho demonstrates portable parallel performance across different architectures with variation less than 10%. Synchronizations. Next we consider the number of synchronizations Basker has to perform. We do not provide the time for these synchronizations as they can be as short as only a few cycles and cannot be measured accurately. These synchronizations points are needed for multiple threads to work on the same column, and grow with the number of threads that work on a column. We will consider the maximum number of synchronizations that have to happen at each level of dependency tree (See Figure 3(b) ), where we number the tree starting at root=0. We can express the number of synchronization points as:
where max(col(k)) is the maximum number of columns in a submatrix U at dependency tree level k. Note that the number of syncs is highly dependent on the quality of node separators from the nested-dissection.
In Figure 6 , the maximal number of synchronization points for all matrices is given in solid lines. Additionally, reference functions (i.e., 1000(p − 2) and 1000lg(p − 2)) are given in dotted lines. We notice the number of synchronization points can be as bad as (p − 2)× in the beginning. However, this high growth rate has little impact on performance. One reason is that α S , which depends on global access cost, would be small in a socket. However, the growth rate decreases with threads, though α S would increase. For example, Xeon Phi has one of the most expensive access cost (See next analysis point) and G3 Circuit has one of the highest number of sync points. At 32 cores, the time flattens or even goes up. However, the variation in the number of sync points between matrices (∼ 2× on 2 and 32 threads) does not seem to have a detrimental effect on performance for both Sandy Bridge and Power 8 in a socket. Global Accesses. Next, we consider the impact of using a global access array to mark the sparsity pattern in Basker. We note this array is not the only data that is shared between threads (e.g., synchronization points), but is the only data structure that is accessed simultaneously by all threads. Therefore, high NUMA effects related to sharing memory and synchronization will not show up in a metric that considers overhead due to global access array. It will however provide insights on to the penalty for multiple threads using the same data structure that would come from invalidating cache lines and false sharing. Here, we define:
where T access (Basker, p, m) is the time for global accesses and was measured with independent timers for each thread. Figure 7 presents these values for fill level 2. We observe that Xeon Phi has the highest overhead cost and Sandy Bridge has the lowest. The overhead of accesses by Xeon Phi is about 10× higher than Sandy Bridge. Power8 has similar or lower cost on a single NUMA region. However, in a socket (8 cores), the cost is higher than Sandy Bridge using a shared L3 cache. This is multiplied even more when moving between the socket. We note that even though Power8 has less global access overhead it still has similar speedups (though faster time) as Xeon Phi with 16 cores. Therefore, we hypothesize other effects of NUMA on sharing memory must be impacting performance. However, the number synchronization points is far less than marking sparsity pattern. Therefore the synchronization cost, even on the problem with the most penalty on the Xeon Phi, could not account for more than 10% of execution time.
Since the number of global accesses increases with the number of nonzero entries in a row, the overhead depends on the sparsity pattern of a matrix and the fill level of factorization. This also implies that Basker has the opposite performance characteristics to Tacho for a particular matrix exposing relatively dense sparsity patterns. Note that the portion of tasking overhead decreases with an increasing number of nonzero entries. For instance, we can see the higher global access overhead for the pwtk matrix than other test problems especially on the Xeon Phi coprocessor as illustrated in Figure 7 . On the Power8 processor, the global access Fig. 7 . Global access overhead of Basker on incomplete factorization using fill level 2. We see that magnitude impact on Xeon Phi. overhead substantially increases when it requires to cross over the NUMA regions, and the higher memory access latency limits overall performance as shown in Figure 4 (f). Working Sets. Lastly, we consider how much shared memory can be used by multiple threads. This will relate to the number of hits in different levels of cache and will directly affect α O . This an important aspect as it attempts to capture a big difference between Tacho and Basker. For example, on the Xeon Phi, Basker is ∼ 9.8% slower than Tacho for G3 Circuit with level 2 fill. We know that global accesses only account for ∼ 5.1% of the execution time and synchronization points would account for < 2% based on the ratio fo sync points to pattern accesses. This means that about > 3% difference is explained by the remainder.
We do this by analyzing working sets [27] . The working set is the amount of data that an algorithm can keep for reuse, such as previously factorized submatrices that can be used for updating new submatrices. Therefore, we define:
where S i are submatrices in factored matrix and |S| is the number of submatrices. Figure 8 presents the working set size for Basker and Tacho. We note that Basker's working set size changes with the number of threads as block sizes of leaf level nodes vary. Unlike Basker, Tacho uses the same block size and number of tasks as the algorithm is independent of the number of 404 404 threads. Horizontal lines mark the working set size based on architecture and thread count, e.g., Power8(1) is Power8 with 1 core. Despite the Xeon Phi having shared L2 segments across all cores, we only report the local size because latency varies greatly depending on location. We observe that Sandy Bridge and Power8 have memory large enough to fit the working set size of Basker, but Xeon Phi does. Therefore, accesses may pay a high latency cost in shared L2 or have to go to main memory, and this may account for limited performance of Basker on Xeon Phi. On the other hand, Tacho generates fine-grained tasks of which data fits in the cache sizes of all test machines, providing better data locality. However, Power8 cache size may also be detrimental to the performance of Basker. In the Power8 system, accessing data in the cache of a core not in the same socket can be slower than accessing from DRAM. Since the cache is so large, data recently used may not be evicted resulting in accesses going to the cache of other processors.
D. Summary of Findings
Here we provide an overview of our findings from our analysis. We do this by first outlining our general finding for many-core system, and after finding related to sparse factorization. We find that task parallelism's runtime task scheduler can deliver a robust and portable performance on different architectures. However, the tasking overhead may hurt performance when task and thread counts are small on socket-based CPUs. Meanwhile, even a large number of synchronization points in Basker does not seem to harm performance. The choice to use a global accesses is not ideal for systems that suffer from high latency memory access between cache in different units. Lastly, the ability for computation with reuse to fit into cache is critical. Independent sets of computation based on partition of data, i.e., data parallelism, can be farther broken down beyond the number of threads. However, this may result in more synchronization or viewing multiple tasks in a computation.
These finding are critical to understanding performance of sparse factorizations. First, we noticed that either task or data parallelism are relatively fine for Intel Sandy Bridge type systems that currently dominate HPC. On many-core systems, task parallelism allow for easy implementation of sizes that can fit into the complex cache system. This means that matrices with a high number of nonzeros are better for task parallel codes. However, very sparse matrices like ecology2 and G3 Circuit with level 0 fill are better on the Sandy Bridge and Power8 systems with data-parallelism. We note in Figure 8 that in both these cases the workset fit into cache, which means that overheads for data parallelism is much lower than tasking overhead and that for very sparse matrices one should not use tasking. Second, the use global access can be expensive, e.g., as high as 10% on Xeon Phi. However, the use of a global accesses for sparsity pattern can allow the factorization flexibility. For example, the code can switch fillin criteria if something ill-posed happens in the numeric phase, such as diagonal becoming too small. If we know matrices are well posed, the sparsity pattern should be found before hand. However this optimization is not critical in socket based systems, since the overhead is on the order of system noise, i.e., < 4%.
V. RELATED WORK
Both factorization codes used techniques from Hyson and Pothen [28] , [29] . These studies used MPI to test incomplete factorization in a distributed memory environment. In these environments, using a SGI Origin2000, no real measurable speedups where seen due to machine noise. However, they prove to be sound in theory when applied to shared memory environments.
Several other studies have looked at task parallel incomplete sparse factorization and analyzed their performance. Aliaga et al. [30] demonstrates the incomplete factorization ILUPack that uses the task scheduler OmpSs for precondition Conjugate Gradients (CG). This factorization uses a data-flow model to generate 1D tasks unlike algorithm-by-block. However, dataflow has several down falls compared to Tacho (See Kim et al. [5] for more details). They report an average or lower speedup on x86 CPUs, and do not explore many-core systems. Moreover, they do not compare tasking overhead to overheads in data parallel methods, or the impact of tasking overhead in relationship to fill. Related to tasking runtime, many studies have been conducted for dense linear algebra. These include QUARK [31] , SuperMatix [23] , and DAGuE [32] . However, none of these have compared how sparse matrices would benefit from their runtime.
Lastly, Chow and Patel [33] have demonstrated a very fine-grained parallel incomplete factorization, which can be executed naively in a data-parallel manner. They do this by first finding the sparsity pattern. After, they solve in "sweeps" for each nonzero as a system on nonlinear equations. As a result their method is an approximation, but they have shown that the number of sweeps can be low in order to provide an efficient preconditioner.
VI. CONCLUSION
Many implementation choices exist for sparse matrix algorithms. High-level programing models, such as Kokkos, provide a framework to write performance portable code on multi/many-core system. However, performance portable does not translate to proportional performance between systems. Many other variables must be taken into account, such as memory latencies and cache sizes. We demonstrate that task and data parallelism are viable options for sparse incomplete factorization on x86 style CPUs, but that additional consideration needs to be taken with other types of many-core systems. In general, we find that task-parallel sparse incomplete factorization is better on very sparse matrices for systems that have some shared low latency cache system. When dealing with high fill-in and large amounts of threads, the tasking overhead can be amortized. Additionally, global accesses in these system to find fill-in dynamically is not detrimental to performance. However, systems lacking a deep cache structure, such as Xeon Phi, are not ideal for data-parallel sparse incomplete factorization, since even the smallest problems will not fit into cache.
