The emergence and continuing use of multi-core architectures require changes in the existing software and sometimes even a redesign of the established algorithms in order to take advantage of now prevailing parallelism. The Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) is a project that aims to achieve both high performance and portability across a wide range of multi-core architectures. We present in this paper a comparative study of PLASMA's performance against established linear algebra packages (LAPACK and ScaLAPACK), against new approaches at parallel execution (Task Based Linear Algebra Subroutines -TBLAS), and against equivalent commercial software offerings (MKL, ESSL and PESSL). Our experiments were conducted on one-sided linear algebra factorizations (LU, QR and Cholesky) and used multi-core architectures (based on Intel Xeon EMT64 and IBM Power6). A performance improvement of 67% was for instance obtained on the Cholesky factorization of a matrix of order 4000, using 32 cores.
INTRODUCTION AND MOTIVATIONS
The emergence of multi-core microprocessor designs marked the beginning of a forced march toward an era of computing in which research applications must be able to exploit parallelism at a continuing pace and unprecedented scale [1] . This confronts the scientific software community with both a * Research reported here was partially supported by the National Science Foundation and Microsoft Research † Computer Science and Mathematics Division, Oak Ridge National Laboratory, USA ‡ School of Mathematics & School of Computer Science, University of Manchester, United Kingdom Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SuperComputing '09 Portland, Orgeon USA Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$10.00. daunting challenge and a unique opportunity. The challenge arises from the disturbing mismatch between the design of systems based on this new chip architecture -hundreds of thousands of nodes, a million or more cores, reduced bandwidth and memory available to cores -and the components of the traditional software stack, such as numerical libraries, on which scientific applications have relied for their accuracy and performance. So long as library developers could depend on ever increasing clock speeds and instruction level parallelism, they could also settle for incremental improvements in the scalability of their algorithms. But to deliver on the promise of tomorrow's petascale systems, library designers must find methods and algorithms that can effectively exploit levels of parallelism that are orders of magnitude greater than most of today's systems offer. This is an unsolved problem.
To answer this challenge, we have developed a project called Parallel Linear Algebra Software for Multi-core Architectures (PLASMA) [2, 3] . PLASMA is a redesign of LAPACK [4] and ScaLAPACK [5] for shared memory computers based on multi-core processor architectures. It addresses the critical and highly disruptive situation that is facing the linear algebra and high performance computing (HPC) community due to the introduction of multi-core architectures. To achieve high performance on this type of architecture, PLASMA relies on tile algorithms, which provide fine granularity parallelism. The standard linear algebra algorithms can then be represented as a Directed Acyclic Graph (DAG) [6] where nodes represent tasks, either panel factorization or update of a block-column, and edges represent dependencies among them. Moreover, the development of programming models that enforce asynchronous, out of order scheduling of operations is the concept used as the basis for the definition of a scalable yet highly efficient software framework for computational linear algebra applications. In PLASMA, parallelism is no longer hidden inside Basic Linear Algebra Subprograms (BLAS) [7] but is brought to the fore to yield much better performance. Each of the one-sided tile factorizations presents unique challenges to parallel programming. Cholesky factorization is represented by a DAG with relatively little work required on the critical path. LU and QR factorizations have exactly the same dependency pattern between the nodes of the DAG. These two factorizations exhibit much more severe scheduling and numerical (only for LU) constraints than the Cholesky factorization. PLASMA is currently scheduled statically with a trade off between load balancing and data reuse.
As multicore systems require finer granularity and higher asynchronicity, considerable advantages may be obtained by reformulating old algorithms or developing new algorithms in a way that their implementation can be easily mapped on these new architectures. For instance, block operations in the standard LAPACK algorithms for some common factorizations need to be broken into sequences of smaller tasks or tiles in order to achieve finer granularity and higher flexibility in the scheduling of tasks to cores. Besides providing fine granularity, the use of tile algorithms also makes it possible to use more efficient storage format for the data such as Block Data Layout (BDL). Matrix elements are now stored contiguously in memory within one tile which engender a small stride access of memory and therefore, a low cache miss rate.
The development of these new algorithms for multi-core systems on one-sided factorizations motivated us to show improvements made by the tile algorithms in performing a comparative study of PLASMA against established linear algebra packages (LAPACK and ScaLAPACK) as well as equivalent commercial software offerings (MKL, ESSL and PESSL). Moreover, we also compared a new approach at parallel execution called TBLAS [8] (Task Based Linear Algebra Subroutines) in the Cholesky and QR cases -the LU factorization not being implemented yet. The experiments were conducted on two different multi-core architectures based on Intel Xeon EMT64 and IBM Power6, respectively.
The paper is organized as follows. Section 2 provides an overview of the libraries and the hardware used in the experiments, as well as a guideline to interpret the results. Section 3 describes the tuning process to achieve the best performance of each software. In section 4 we present a comparative performance evaluation of the different libraries before concluding and presenting future work directions in Section 5.
EXPERIMENTAL ENVIRONMENT
We provide below an overview of the software used for the comparison against PLASMA as well as a description of the hardware platforms, i.e., Intel Xeon EMT64 and IBM Power6 machines. We also introduce some performance metrics that will be useful to the interpretation of the experimental results.
Libraries

LAPACK, MKL and ESSL
LAPACK 3.2 [4] , an acronym for Linear Algebra PACKage, is a library of Fortran subroutines for solving the most commonly occurring problems in numerical linear algebra for high-performance computers. It has been designed to achieve high efficiency on a wide range of modern highperformance computers such as vector processors and shared memory multiprocessors. LAPACK supersedes LINPACK [9] and EISPACK [10] . The major improvement was to rely on the use of a standard set of Basic Linear Algebra Subprograms (BLAS) [7] , which can be optimized for each computing environment. LAPACK is designed on top of the level 1, 2, and 3 BLAS, and nearly all of the parallelism in the LAPACK routines is contained in the BLAS. MKL 10.1 (Math Kernel Library) [11] , a commercial offering from Intel, is a library of highly optimized, extensively threaded math routines that require maximum performance such as BLAS. We consider in this paper the 10.1 versionthe latest available version when the paper was submitted.
Library) [12] is a collection of subroutines providing a wide range of highly tuned mathematical functions specifically designed to improve the performance of scientific applications on the IBM Power processor-based servers (and other IBM machines).
ScaLAPACK and PESSL
ScaLAPACK 1.8.0 (Scalable Linear Algebra PACKage) [5] is a parallel implementation of LAPACK for parallel architectures such as massively parallel SIMD machines, or distributed memory machines. It is based on the Message Passing Interface (MPI) [13] . As in LAPACK, ScaLAPACK routines rely on block-partitioned algorithms in order to minimize the frequency of data movement between different levels of the memory hierarchy. The fundamental building blocks of the ScaLAPACK library are distributed memory versions of the Level 1, Level 2, Level 3 BLAS, called the Parallel BLAS or PBLAS [14] , and a set of Basic Linear Algebra Communication Subprograms (BLACS) [15] for communication tasks that arise frequently in parallel linear algebra computations. In the ScaLAPACK routines, the majority of interprocessor communication occurs within the PBLAS and the interface is as similar to the BLAS as possible. The source code of the top software layer of ScaLA-PACK looks very similar to that of LAPACK. Although ScaLAPACK and PESSL are primarily designed for distributed memory architectures, a driver optimized for shared-memory machines allows memory copies instead of actual communications.
TBLAS
The TBLAS [8] project is an on-going effort to create a scalable, high-performance task-based linear algebra library that works in both shared memory and distributed memory environments. TBLAS takes inspiration from SMP Superscalar (SMPSs) [16] , PLASMA and ScaLAPACK. It adopts a dynamic task scheduling approach to execute dense linear algebra algorithms on multi-core systems. A task-based library replaces the existing linear algebra subroutines such as PBLAS. TBLAS transparently provides the same interface and functionality as ScaLAPACK. Its runtime system extracts a DAG of tasks from the sequence of function calls of a task-based linear algebra program. Data dependencies between tasks are identified dynamically. Their execution is driven by data availability. To handle large DAGs, a fixedsize task window is enforced to unroll a portion of the active DAG on demand. Hints regarding critical paths (e.g., panel tasks for factorizations) may also be provided to increase the priority of the corresponding tasks. Although the TBLAS runtime system transparently handles any data movement required for distributed memory executions, we will not use this feature since we focus on shared-memory multi-core architectures. So far the TBLAS project is in an experimental stage and the LU factorization (DGETRF routine) is not completed yet.
PLASMA, TBLAS, LAPACK and ScaLAPACK are all linked with the optimized vendor BLAS available on the system, which are MKL 10.1 and ESSL 4.3 on the Intel64 and Power6 architectures, respectively.
Hardware
Intel64-based 16 cores machine
Our first architecture is a quad-socket quad-core machine based on an Intel Xeon EMT64 E7340 processor operating at 2.39 GHz. The theoretical peak is equal to 9.6 Gflop/s/ per core or 153.2 Gflop/s for the whole node, composed of 16 cores. There are two levels of cache. The level-1 cache, local to the core, is divided into 32 kB of instruction cache and 32 kB of data cache. Each quad-core processor being actually composed of two dual-core Core2 architectures, the level-2 cache has 2×4 MB per socket (each dual-core shares 4 MB). The effective bus speed is 1066 MHz per socket leading to a bandwidth of 8.5 GB/s (per socket). The machine is running Linux 2.6.25 and provides Intel Compilers 11.0 together with the MKL 10.1 vendor library.
Power6-based 32 cores machine
The second architecture is a SMP node composed of 16 dual-core Power6 processors. Each dual-core Power6 processor runs at 4.7 GHz, leading to a theoretical peak of 18.8 Gflop/s per core and 601.6 Gflop/s per node. There are three levels of cache. The level-1 cache, local to the core, can contain 64 kB of data and 64 kB of instructions; the level-2 cache is composed of 4 MB per core, accessible by the other core; and the level-3 cache is composed of 32 MB common to both cores of a processor with one controller per core (80 GB/s). An amount of 256 GB of memory is available and the memory bus (75 GB/s) is shared by the 32 cores of the node. The machine runs AIX 5.3 and provides the xlf 12.1 and xlc 10.1 compilers together with the ESSL 4.3 vendor library. Actually, the whole machine is the IBM Vargas 1 -ranked 50th on the TOP500 of November 2008 [17] -from IDRIS 2 and is composed of 112 SMP nodes for a total 67.38 Tflops/s theoretical peak. However, we use only a single node at a time since we focus on shared-memory, multi-core architectures.
Performance metrics (How to interpret the graphs)
In this section, we present some performance references and definitions that we will consistently report in the paper to evaluate the efficiency of the libraries presented above.
We report three performance references. The first one is the theoretical peak of the hardware used. All the graphs in the paper are scaled to that value, which depends on the processor used (Intel64 or Power6) and the number of cores involved (or the maximum number of cores if results on different number of cores are presented in the same graph). For instance, the graphs of Figures 1(a) and 1(b) are scaled to 153.2 Gflop/s and 601.6 Gflop/s since they respectively involve a maximum of 16 Intel64 cores and 32 Power6 cores. The second reference is the parallel vendor DGEMM performance (i.e., the MKL and ESSL ones on Intel64 and Power6, respectively). DGEMM is a commonly used performance reference for parallel linear algebra algorithms; achieving a comparable performance is challenging since it is a parallelfriendly level-3 BLAS operation. Figures 1(a) and 1(b) illustrate very good scalability on both platforms.
The last reference we consider is related to the peak performance of the serial computational intensive level-3 BLAS kernel effectively used by PLASMA. The kernel used depends on the factorization; they are respectively dgemm-seq, dssrfb-seq and dssssm-seq for the Cholesky (DPOTRF), QR (DGEQRF) and LU (DGETRF) algorithms. The dssrfb-seq and dssssm-seq routines are new kernels (not part of BLAS nor LAPACK standard) that have been specifically designed for tile algorithms. They involve extra-flops (that can lead to a total 50% overhead if no inner blocking occurs) [2] .
Their complete description is out of the scope of this paper and more information can be found in [2] . The performance of a PLASMA thread (running on a single core) is bounded by the performance -that we note kernelseq -of the corresponding serial level-3 kernel. Therefore, the performance of an execution of PLASMA on p cores is bounded by p × kernelseq . For instance, Figures 2(a), 2(b) and 2(c) respectively show the performance upper bound for the DPOTRF, DGEQRF and DGETRF (PLASMA) factorizations; they correspond to the horizontal 16×dgemm-seq, 16×dssrfb-seq and 16×dssssm-seq dashed lines.
The speedup of a parallel execution on p cores is defined by the following formula:
where p is the number of cores, t1 the execution time of the sequential algorithm and tp the execution time of the parallel algorithm when p cores are used. A linear speedup is obtained when sp = p. This corresponds to the ideal case where doubling the number of cores doubles the speed.
Another performance metric commonly used for parallel performance evaluation is the efficiency. It is usually defined as follows: Ep = t 1 p×tp . However, this metric does not provide much information about the relative performance of a library against another one, which is the focus of this paper. Therefore we introduce the normalized efficiency as ep = t kernelseq p×tp where t kernelseq is the time that the serial level-3 kernel (dgemm-seq, dssrfb-seq or dssssm-seq) would spend to perform an equivalent number of effective floating point operations. We say it is normalized since the sequential reference time t kernelseq consistently refers to a common operation independently of the considered library (but that still depends on the considered factorization). Therefore, that metric directly allows us to compare two libraries to each other. Furthermore, PLASMA normalized efficiency is bounded by a value of 1 (ep(P LASM A) ≤ 1) since the level-3 serial kernels are the tasks that achieve the highest performance during the PLASMA factorization process. That normalized efficiency can also be interpreted geometrically. It corresponds to the performance ratio of the considered operation to the p × kernelseq reference. For instance, in Figure 2 being equal to 129.6 Gflops, the peak normalized efficiency is thus equal to ep = 0.79.
TUNING
Maximizing the performance and minimizing the execution time of scientific applications is a daily challenge for the HPC community. The libraries presented in the previous section have tunable execution parameters and a wrong setting for a single parameter can dramatically slow down the whole application. This section presents an overview of the tuning method applied to each library.
PLASMA
PLASMA performance strongly depends on tunable execution parameters trading off utilization of different system resources, the outer and the inner blocking sizes as illustrated in Figure 2 . The outer block size (NB) trades off parallelization granularity and scheduling flexibility with single core utilization, while the inner block size (IB) trades off memory load with extra-flops due to redundant calculations. Only the QR and LU factorizations use inner blocking. If no inner blocking occurs, the resulting extra-flops overhead may represent 25% and 50% for the QR and LU factorization, respectively [2] . Tuning PLASMA consists of finding the (NB,IB) pairs that maximize the performance depending on the matrix size and on the number of cores. An exhaustive search is cumbersome since the search space is huge. For instance, in the QR and LU cases, there are 1352 possible combinations for (NB,IB) even if we constrain NB to be an even integer between 40 and 500 and if we constrain IB to divide NB. All these combinations cannot be explored for large matrices (N >> 1000) on effective factorizations in a reasonable time. Knowing that this process should be repeated for each number of cores and each matrix size motivates us to consider a pruned search. The idea is that tuning the serial level-3 kernel (dgemm-seq, dssrfb-seq and dssssm-seq) is not time-consuming since peak performance is reached on relatively small input matrices (N B < 500) that can be processed fast. Therefore, we first tune those serial kernels. As illustrated in Figure 3 , not all the (NB,IB) pairs result in a high performance. For instance, the (480,6) pair leads to a performance of 6.0 Gflop/s whereas the (480,96) pair achieves 12.2 Gflop/s, for the dssrfb-seq kernel on Power6 (Figure 3(e) ). We select a limited number of (NB,IB) pairs (pruning step) that achieve a local maximum performance on the range of NB. We have selected five or six pairs on the Intel64 machine for each factorization and eight on the Power6 machine ( Figure 3 ). We then benchmark the performance of PLASMA factorizations only with this limited number of combinations (as seen in Figure 2) . Finally, the best performance obtained is selected.
The dssssm-seq efficiency depends on the amount of pivoting performed. The average amount of pivoting effectively performed during a factorization is matrix-dependent. Because the test matrices used for our LU benchmark are randomly generated with a uniform distribution, the amount of pivoting is likely to be important. Therefore, we have selected the (NB,IB) pairs from dssssm-seq executions with full pivoting (figures 3(c) and 3(f)). The dssssm-seq performance drop due to pivoting can reach more than 2 Gflop/s on Power6 (Figure 3(f) ).
We have validated our pruned search methodology for the three one-sided factorizations on Intel64 16 cores. To do so, we have measured the relative performance overhead (percentage) of the pruned search (PS) over the exhaustive search (ES), that is: 100 × ( Table 1 shows that the pruned search performance overhead is bounded by 2%. Because the performance may slightly vary from one run to another on cache-based architectures [18] , we could furthermore observe in some cases higher performance (up to 0.95%) with pruned search (negative overheads in Table 1 ). However, the (NB,IB) pair that leads to the highest performance obtained with one method consistently matches the pair leading to the highest performance obtained with the other method. These results validate our approach. Therefore, in the following, all the experimental results presented for PLASMA will correspond to experiments conducted with a pruned search methodology.
TBLAS
Like PLASMA, TBLAS performance relies on two tunable parameters, NB and IB. However, preliminary experimental results led the authors to systematically set IB equal to 20% of NB (for the QR factorization). In its current development stage, TBLAS requires that NB divides the matrix size N; this constraint reduces the search space and makes it possible to perform an exhaustive search. The TBLAS results reported in Section 4 will thus correspond to experiments conducted with an exhaustive search.
ScaLAPACK and PESSL
ScaLAPACK and PESSL are based on block-partitioned algorithms and the matrix is distributed among a processor grid. The performance thus depends on the the topology of the processor grid and on the block size (NB). For instance, when using 32 cores, the grid topology can be 8x4 (8 rows, 4 Because there is a limited number of possible combinations, in the following we will report performance results obtained with an exhaustive search: for each matrix size and number of cores, all the possible processor grid distributions are tried in combination with the four block sizes (N B ∈ {32, 64, 128, 256}). And the highest performance is finally reported.
LAPACK,MKL andESSL
In LAPACK, the blocking factor (NB) is hard coded in the auxiliary routine ILAENV [19] . This parameter has been set to 64 for both LU and Cholesky factorizations and to 32 for QR factorization 3 . MKL andESSL have been highly optimized by their respective vendors.
COMPARISON TO OTHER LIBRARIES
Methodology
Factorizations are performed in double precision. We briefly recall the tuning techniques used in the experiments discussed below according to Section 3. We employ a pruned search for PLASMA; an exhaustive search for TBLAS, ScaLA-PACK and PESSL; and we consider that LAPACK, MKL and ESSL have been tuned by the vendor.
Furthermore, to capture the best possible behavior of each library, we repeat the number of executions (up to 10 times) and we report the highest performance obtained. We do not flush the caches [20] before timing a factorization 4 . However, the TLB (Translation Lookaside Buffer) is flushed between two executions: the loop over the different executions is performed in a script (rather than within the executable) and calls several times the same executable.
ScaLAPACK, PESSL and PLASMA interfaces allow the user to provide data distributed on the cores. In our sharedmemory multi-core environment, because we do not flush the caches, these libraries have thus the advantage to start the factorization with part of the data distributed on the caches. This is not negligible. For instance, a 8000 × 8000 double precision matrix can be held distributed on the L3 caches of the 32 cores of a Power6 node.
Experiments on few cores
We present in this section results of experiments conducted on a single socket (4 cores on Intel64 and 2 cores on Power6). In these conditions, data reuse between cores is likely to be efficient since the cores share different levels of cache (see Section 2.2). Therefore, we may expect a high speedup (see definition in Section 2. Even LAPACK has a linear speedup. In the DPOTRF case, most of the computation is spent by the dgemm-seq kernel. Since LAPACK exploits parallelism at the BLAS level, the parallel DGEMM version is actually executed. However, since those tasks are processed in sequence, this approach is very synchronous and does not scale with the number of cores. Indeed, Figure 9 (a) shows that LAPACK DPOTRF does not benefit from any significant performance improvement anymore when more than 4 cores (i.e. more than one socket) are (is) involved. On Power6, the LA-PACK DPOTRF performance also stabilizes at 20.0 Gflop/s (on 2 cores this time, see Figure 5 (d)) which corresponds to an even higher normalized efficiency e2 = 0.69. The very high bandwidth (80 MB/s per core) of the L3 shared cache strongly limits the overhead due to this synchronous approach. Furthermore, the high memory bandwidth (75 GB/s shared bus) allows a non negligible improvement of the LAPACK DPOTRF when going to more than 2 coresand up to 16 -as illustrated by Figure 9(b) . However, the performance hardly exceeds 100 Gflop/s, which is only one sixth of the theoretical peak of the 32 cores node. Similar observations can be reported from Figures 5 and 9 for the LAPACK DGEQRF and DGETRF routines (except that a non negligible performance improvement can be obtained up (l) DGETRF -Power6 -32 cores Figure 4 : Effect of NB on TBLAS (a,b,c,e,f,g) ScaLAPACK (i,j,k) and PESSL (l,m,n) performance (Gflop/s). to 8 cores in the LAPACK DGETRF case on Intel -see Figure 9(a) ). In a nutshell, parallelism at the BLAS level can achieve a decent performance when the cores involved are limited to a single socket.
Figures 5(c) and 5(f) show that the vendor libraries (MKL and ESSL) outperform the other libraries -including PLAS-MA-when the LU factorization (DGETRF routine) is executed on a single socket. They even outperform the upper bound of the tile algorithms. Indeed, geometrically, their graphs are above the 4×dssssm-seq (resp. 2×dssssm-seq) dashed lines and, numerically, their peak normalized efficiency e4 = 1.13 (resp. e2 = 1.29) is higher than 1. This result shows that when the hardware allows a very efficient data reuse across the cores -thanks to the high bandwidth of shared level of caches -the lower performance of the kernels used in the tile algorithms cannot be balanced by the better scheduling opportunities that these algorithms provide. The lower performance of tile algorithms kernels is both due to a current non optimized implementation (that we plan to improve) and to the extra-flops performed (that may be decreased at the price of a lower data reuse). The dssssm-seq kernel is the one involving the largest amount of extra-flops for going to tile algorithms (that can lead to a total 50% overhead if no inner blocking occurs). When the extra-flops count is lower, tile algorithms remain competitive. Figures 5(b) and 5(e) illustrate this phenomenon for the QR factorization (DGEQRF routine) whose extra-flops count is bounded by 25% (and less if inner blocking occurs). In this case, tile algorithms (PLASMA and TBLAS) can almost compete against vendor libraries; their better scheduling opportunities enable balancing the extra-flops count (as well as the current non optimized implementation of their serial kernels) although the hardware efficiency for parallel data reuse limits the impact of a non optimum scheduling. The Cholesky factorization (DPOTRF routine) does not involve extra-flops for going to tile algorithms and directly relies on a kernel optimized by the vendor (dgemm-seq). As illustrated in figures 5(a) and 5(d), in this case, tile algorithms (and in particular PLASMA) outperform the algorithms implemented in the other libraries, including the vendor libraries.
In a nutshell, the overhead of imperfect scheduling strategies is limited when the algorithms are executed on few cores that share levels of cache they can access with a very high bandwidth. In this case, tile algorithms have a higher performance than established linear algebra packages depending on whether the extra-flops count for going to tile and the overhead of a non optimized serial kernel is not too large compared to the limited practical improvement gained by their scheduling opportunities. The advantage of tile algorithms relies on their very good scalability achieved by their scheduling opportunities; Figure 7 shows their excellent practical scalability on both machines in the PLASMA case for instance. In the next section, we discuss their relative performance against other libraries when a larger number of cores are used.
Large number of cores
We present in this section results of experiments conducted on a larger number of cores (16 cores on Intel64; 16 and 32 cores on Power6). In these conditions, the cost of data reuse between cores is higher when the cores do not share a common level of cache. For instance, on the Power6 architecture, if a core accesses data that have been processed by another processor, it will access that data through the memory bus, which is shared by all the 32 cores. Even if that bus has a high bandwidth (75 GB/s), the actual throughput is limited when many cores simultaneously access the memory. Therefore, we expect a higher sensitivity of the algorithms to the scheduling strategy than we indicated in Section 4.2. Figures 6(a), 6 (d) and 6(g) present the Cholesky (DPO-TRF routine) performance of the different libraries. PLAS-MA consistently outperforms the other libraries, followed by TBLAS. These results illustrate the performance improvement brought by tile algorithms. For instance, a speedup of 67% was obtained for a Cholesky factorization on a matrix of order 4000 on 32 cores against vendor libraries and ScaLAPACK. The higher efficiency of PLASMA compared to TBLAS is essentially due to a better data reuse. Indeed, PLASMA scheduling strategy maximizes data reuse and thus benefits from a better cache effect than TBLAS whose scheduler does not take into account date reuse. PLAS-MA is even faster than the parallel DGEMM reference up to a matrix size N = 2000 on the Intel64 machine when 16 cores are used. This is not contradictory since PLAS-MA does not rely on the parallel version of DGEMM. Each PLASMA thread indeed uses the serial dgemm-seq. Better performance can then be achieved thanks to better scheduling. However, the DPOTRF factorization cannot only be performed with efficient level-3 BLAS operations (it also involves level-2 BLAS operations). Therefore, as expected, the parallel DGEMM dominates all the other operations when the matrix size becomes larger. This illustrates the fact that using a fine enough granularity (as in tile algorithms) is more critical when processing small or moderate size matrices. Indeed, the better load balancing allowed by a fine granularity does not impact the steady state of the DAG processing. This also explains the major improvement brought by PLASMA compared to the other libraries on matrices of small and moderate size.
Still considering the DPOTRF routine, the vendor libraries (MKL and ESSL) remain the most competitive solution versus tile algorithms (but they have a lower efficiency than tile algorithms) compared to ScaLAPACK and LAPACK approaches, up to 16 cores. However, ESSL does not scale well to 32 cores (figures 8(j), 8(k), 8(l)). ScaLAPACK and its vendor equivalent, PESSL, both outperform ESSL on 32 cores (Figure 6(g) ). This result generalizes, to some extent, to all the factorization routines. better performance than PLASMA when 16 cores are used and when the matrix size is larger than or equal to 10,000 (N ≥ 10, 000). Indeed, when the matrices processed are large, the critical issue of scheduling corresponds to maximizing a steady state throughput. The main disadvantage of a static schedule is that cores may be stalling in situations where work is available. This throughput is easier to maximize with a dynamic scheduling strategy. Approaches such as TBLAS, which do implement a dynamic scheduling, are thus likely to achieve a higher performance than approaches that implement a static scheduling (such as PLAS-MA currently does). All in all, these results are motivation to move towards an hybrid scheduling strategy that would assign priorities according to a trade off between data reuse and critical path progress and would process available tasks dynamically. Figure 6 (e) illustrates the performance of the QR factorization (DGEQRF routine) when only half of the available cores are used on Power6. In this case, PLASMA still achieves the highest performance but TBLAS and ESSL almost reach a similar performance.
Finally, figures 6(c), 6(f) and 6(i) show that the LU factorization (DGETRF routine) has a performance behavior similar to the QR factorization and PLASMA again outperforms the other libraries. However, the lower efficiency of dssssm-seq compared to dssrfb-seq (dssssm-seq performs more extra-flops) induces a lower performance of the PLAS-MA LU factorization compared to the PLASMA QR one. On Intel64, this leads MKL to be slightly better than PLAS-MA when the matrix size is larger than or equal to 10,000 (N ≥ 10, 000). But, similarly to the QR case, moving towards a hybrid scheduling should remove the penalty due to the static scheduling strategy used in PLASMA and strongly improve the performance on large matrices. Indeed, on Intel64 when 16 cores are used, the PLASMA LU peak normalized efficiency (see definition in Section 2.3) is equal to 78.6%; so there is still room for improving the performance by enabling dynamic scheduling. Furthermore, as already mentioned, an optimized implementation of the dssssm-seq kernel will also improve the performance of tile algorithms.
CONCLUSION AND PERSPECTIVES
This paper has analyzed and discussed the behavior of several software approaches for achieving high performance and portability of one-sided factorizations on multi-core architectures. In particular, we have shown the performance improvements brought by tile algorithms on up to 32 cores -the largest shared memory multi-core system we could access. We may expect that these results generalize somewhat to other linear algebra algorithms and even any algorithm that can be expressed by a DAG of fine-grain tasks. Preliminary experiments using tile algorithms for two-sided transformations, i.e., the Hessenberg reduction [21] (first step for the standard eigenvalue problem) and the bidiagonal reduction [22] (first step for the singular value decomposition), show promising results.
We have shown the efficiency of our pruned-search methodology for tuning PLASMA. However, we currently need to manually pick up the (NB,IB) samples (from the serial level-3 kernel benchmarking) that are going to be tested in the parallel factorizations. We are currently working to automate this pruning process. Furthermore, not all the matrix sizes and number of cores can be tested. We are also working on the interpolation of the optimum tuning parameters from a limited number of parallel executions among the range of cores and matrix sizes to the full set of possibilities. This ongoing auto-tuning work should eventually be incorporated within the PLASMA distribution.
Furthermore, because the factorization performance strongly depends on the computational intensive serial level-3 kernels, their optimization is paramount. Unlike DGEMM, the dssrfb-seq and dssssm-seq kernels are not a single call to level-3 BLAS operations, but are composed of successive calls, since the inefficiency. dssrfb-seq and dssssm-seq could achieve similar performance if implemented as a monolithic code and heavily optimized.
The experiments have also shown the limits of static scheduling for the factorization of large matrices. We are currently working on the implementation of a hybrid scheduling for PLASMA. Even if they are not on the critical path, tasks will be dynamically scheduled on idle cores so as to maximize data reuse. 
