Abstract-LU factorization is one of the most widely-used methods for solving linear equations, and thus its performance underlies a broad range of scientific computing. As architectural trends have replaced clock rate improvements with increases in parallel scale, library writers have responded by using tiled algorithms, where operand size is constrained in order to maximize parallelism, as seen in the well-known PLASMA library. This approach has two main drawbacks: (1) asymptotic performance is reduced due to limited operand size; (2) performance of small to medium sized problems is reduced due to unnecessary data motion in the parallel caches. In this paper we introduce a new approach where asymptotic performance is maximized by using special lowoverhead kernel primitives that are auto-generated by the ATLAS framework, while unnecessary cache motion is minimized by using explicit cache management. We show that this technique can outperform all known libraries at all problem sizes on commodity parallel Intel and AMD platforms, with asymptotic LU performance of roughly 91% of hardware theoretical peak for a 12-core Intel Xeon, and 87% for a 32-core AMD Opteron.
I. INTRODUCTION AND RELATED WORK
LU factorization (essentially Gaussian elimination with partial pivoting) is a critical component for scientific computing. This functionality is exposed to the user via the LA-PACK [1] , [2] (Linear Algebra PACKage) API. In this paper we consider the double precision real version of this routine, and will refer to it using its LAPACK name DGETRF (Double precision GEneral TRiangular Factorization). The main computational component of DGETRF is matrix multiply, which is provided by the BLAS [3] API under the name DGEMM (Double precision GEneral rectangular Matrixmatrix Multiply). DGEMM is typically the most efficient (in both parallel and serial performance) of the provided BLAS operations. The second-most important BLAS call for DGETRF is DTRSM (Double precision TRiangular Solve to a Matrix of right-hand sides). 
A. Experimental Details
Timings presented in this paper were run on Debian 6 (Linuxv2.6.32-5-amd64), gcc 4.7.0, with ACMLv5.3.1, ATLASv3.11.14, FLAMEr11400, LAPACKv3.4.2, MKLv10.3.9, PLASMAv2.5.1. All timings are for LU with LAPACK-equivalent partial pivoting, using the default block sizes provided by the libraries. PLASMA offers both static and dynamic scheduling options; we report static timings, since static scheduling gave dramatically better small-case peformance with essentially the same asymptotic performance. Since parallel timings are volatile (particularly for smaller sizes), we always report the average across a number of timing runs. All our timings are on square matrices of order N . For N ≤ 2000, we average fifty timings, for 2000 < N ≤ 10, 000 six timings, for 10, 000 < N ≤ 20, 000 three timings, and for problems of greater size we average only two timings.
Timings were performed on two commodity sharedmemory platforms: (1) O32: 32-core, 2.0 Ghz AMD Opteron 6128, organized on a 4-socket motherboard. Theoretical peak for entire machine is 256 GFLOPS; (2) X12: 12-core, 2.0 Ghz Intel Xeon E5-2620, organized on a two-socket motherboard wt theoretical machine peak of 192 GFLOPS.
Our results are reported as a percentage of the given theoretical machine peak, using the flop count provided by LAPACK's dopla.f [4] , which for LU is roughly 2 3 N 3 . We use the first touch NUMA memory initialization technique discussed in Section 5.2 of [5] to initialize all matrices, so that page ownership is distributed amongst the parallel cores. This ensures that approaches that copy the input matrix (eg., our approach and PLASMA) do not have a huge advantage over ones that do not (ATLAS, LAPACK, MKL) based merely on input initialization. For each machine, we time problems as large as possible without thrashing virtual memory, which leads to a max size of 50,000 for X12 and 45,000 for O32.
B. History, Motivation and Related Work
Historically, parallel DGETRF providers could focus almost exclusively on asymptotic performance. These are the only problem sizes that require parallel solutions, and Performance of LU factorization on 12-core Intel Xeon E5-2620 for netlib lapack (yellow circles), ATLAS (green diamonds), FLAME (orange point-right triangles), PLASMA (red x), empirically tuned PLASMA (dark red +), and MKL (black point-up triangles) users with small-to medium-sized problems could count on continuing performance gains from increasing clock rate. Since improving clock rate has been largely replaced by increasing parallel scale, it has become much more important to provide parallel speedup for modest sized problems as well. Therefore, Figure 1 , which attempts to summarize the state-of-the-art in parallel DGETRF performance, has an X axis that has been manipulated to allow us to clearly see problems in this range (N < 2000). Note that the first nine data points have a stride of 200, while the larger sizes use a stride of 1000 between points. This leads to the illusion that there are huge increases in performance between points 2000 and 3000: this is due to the change in scale that occurs between these two points. The Y axis shows the percentage of theoretical peak achieved by various DGETRF implementations.
Netlib lapack using the ATLAS [6] - [8] parallel BLAS is the worst performer, as shown in yellow circles. Due to shortcomings in this original approach, the recursive formulation of LU was developed [9] and later implemented in ATLAS; the performance of this variant is shown as green diamonds in Figure 1 . Due to its more effective exploitation of DGEMM, this algorithm is superior at all problem sizes compared to the original LAPACK. Both of these approaches have the strong advantage that they leave almost all of the optimization (both serial and parallel) to the BLAS, freeing the much larger LAPACK library to focus on the algorithmic level. The problem with this approach is its extremely slow rise in parallel performance with problem size; as parallel scale is increased, moderately sized problems will achieve an increasingly minuscule percentage of machine peak due to various inefficiencies and serializations.
Therefore, the current wave of research has concentrated on breaking the problems up into sub-blocks (often called tiles) that can be explicitly scheduled. Examples of this general approach include both the FLAME [10] - [13] (though not for LU) and PLASMA [14] - [17] libraries. In FLAME (orange, point-right triangles), their highest performing LU uses multi-level static blocking with blocking factors roughly tuned for this version MKL; this was shown to provide roughly the same performance as recursion in [18] . However, unlike recursion, this approach's optimality is strongly dependent on both the hardware and the BLAS implementation used, which is why it is generally less recommended than the recursive implementation. Since FLAME's LU is not meaningfully different from a handtuned LAPACK, we omit it from later charts for clarity.
The performance of PLASMA is shown as red "x" in Figure 1 , and it demonstrates the fundamental problems experienced by these tiled algorithms. We see explicitly managing the parallelism at the LU level has paid off with an algorithm that scales much better for moderately sized problems than the prior approaches, but the performance hits a ceiling (in this case at just under 75% of peak), which leads to disappointing asymptotic performance despite scaling that is usually almost perfect in this range. The reason for this ceiling is straightforward: the BLAS contain optimizations such as copying input data to architecture-specific formats that must be amortized over the call to the operation. By breaking the problem into tiles of fixed sizes, these algorithms strongly constrain the problem sizes they call the BLAS with, which leads to these overheads remaining important at all problem sizes. Therefore this efficiency ceiling is in some sense a measure of the percentage of overhead when calling the BLAS with such small problem sizes. While this problem is unavoidable in the BLASbased tiled approach, there is no reason for asymptotic performance to be this poor, since it should be possible to drastically increase the tile size for extremely large matrices while maintaining sufficient parallelism. Therefore, in order to get a best-case for the PLASMA approach, we wrote an autotuner that tried all relevant combinations of inner and outer blocking factors for PLASMA, resulting in the much-improved empirically-tuned PLASMA curve shown as dark red "+". This leads to our first important observation: PLASMA's large-case (and to a lesser extent, small-case) performance could be drastically improved merely by taking problem size into account when choosing blocking factors; since this could be accomplished by instantiating some simple rules of thumb in a case statement, we are puzzled that PLASMA currently ignores problem size.
Note that on this machine, the recursive algorithm actually outperforms both PLASMA variants for extremely large problems: this is because recursion results in BLAS calls of dimensions that grow with problem size, with the largest call dominating execution time. For huge problems like this, low-order overheads are essentially free, and so recursion asymptotically approaches the speed of DGEMM, which is almost always very near machine peak.
Hardware vendors also provide optimized LAPACK and BLAS libraries, the most important of which are Intel's MKL [19] and AMD's ACML [20] . We see that MKL (black points-up triangles) and PLASMA scale roughly equally well for moderately sized problems, but MKL's performance continues to rise until it reaches a remarkable point at just over 90% of peak. Since MKL is a proprietary library tuned by the same company that designs the hardware, we cannot know all of the techniques MKL exploits to achieve this level of efficiency. Therefore, the question becomes: can we match or improve on this performance in a general library such as ATLAS? The answer, we will show, is a decided yes.
II. OUR APPROACH
There are two problems seen in Figure 1 that we would like to address: (1) We would like to enable small-tomoderate sized problems to achieve a much greater percentage of machine peak than even the explicitly parallel approaches like PLASMA deliver; and (2) in explicitly parallelizing the operation, we must avoid the asymptotic ceiling problem that is typical of the tiled/BLAS approach. We address these issues by employing several separate but related techniques.
In order to address moderate sized performance, it is critical to minimize the data movement, which is the dominant cost in this range (even though it is a low order term, and thus unimportant for asymptotic performance). Minimizing data movement at this level requires explicitly controlling (to the extent possible) how things are located in the cache, utilizing cache-optimized storage patterns, and using the owner-computes rule to ensure data loaded to a particular cache is not unnecessarily moved to another cache. We call this technique Parallel Cache Assignment (PCA), and in our previous work [5] , [21] we showed that it can yield the most scalable and highest performing known algorithms for many bus-bound operations in this size range. In order to allow extremely fine-grained parallelization, we also exploit the x86's cache coherence mechanism to provide hardwarespeed parallel synchronization and communication [5] (these overheads can be important in this range). In this earlier work, we concentrated on unblocked computations, but here we have adapted PCA for use in a multilevel staticallyblocked left-looking LU. Our block handling directly descends from the distributed memory work of ScaLAPACK [22] , [23] . Just as in ScaLAPACK, we lay out the cores in a 2-D process grid, and the data is then distributed amongst the threads (and thus amongst the caches) using a 2-D block cyclic distribution. We then explicitly manage and minimize movement in and out of caches using local copies and remote reads just like message passing is used in distributed memory parallelization to minimize communication (i.e. ScaLAPACK's off-node message becomes our out-of-cache operation); this allows us to reuse research on optimal communication patterns in this new context.
The main advantage of PCA is that when the problem is capable of being held in the collective cache of the machine, a PCA algorithm can achieve the theoretical minimum memory access allowed by the algorithm. Since bus bandwidth is the main constraint on performance in this range, this fact alone will greatly increase our performance anytime the entire problem can be held in the collective cache. Since the collective cache has tended to grow at least weakly with parallel scale, the sizes of problems that can be cachecontained is now fairly impressive. The O32 machine has roughly 40MB of usable collective cache (some of the cache is reserved for use by AMD's cache coherence mechanism), while X12 has roughly 30MB of cache. Because caches don't use LRU replacement, they tend to have less effective space even when managed by PCA, and so we can expect our algorithms to start losing some of their cache advantage around N = 1800. The idea is to use PCA to increase performance until data movement becomes a dominated loworder term, and thus provide an algorithm that maximizes performance across all problem sizes.
Even if PCA can deliver good small and moderate-sized performance, the fact that we are using fixed-sized blocks has the possibility of constraining our asymptotic performance just as happens with the tiled/BLAS approach. In ScaLAPACK this was handled by aggregating the cyclically distributed blocks into normal column-major matrices within the local memories, which allowed for extremely large DGEMM calls to be made. While this is better asymptotically than storing data as tiles, it introduces a large copy cost into the algorithm which exerts a negative effect on parallel scaling of non-asymptotic problems. Furthermore, this storage pattern makes poor use of the memory hierarchy. In order to avoid these copy overheads, we rewrote our algorithm to directly call the kernels that ATLAS uses to create its optimized DGEMM, rather than calling DGEMM. We also extended the ATLAS framework so that a user can provide ATLAS with a list of specific problem sizes to be auto-tuned during installation. This allows ATLAS to serve as an auto-tuner for blocked kernel operations, in addition to doing its normal tuning of the BLAS.
To clarify, when ATLAS tunes a complex operation like DGEMM, ATLAS breaks the problem into simpler kernels that can be effectively optimized on a particular machine. These GEMM kernels are simplified matrix multiplications where the problem dimensions have been reduced so that the operands are known to fit into some level of cache. Thus, the operands to these kernels can be thought of as blocks or tiles. The problem then is that these blocks are not stored in normal array or block formats, but rather the formatting is arranged in a fashion that is tuned for both the operation and architecure [24] . ATLAS originally used one format for all machines, but has just been rewritten to allow the block storage to vary depending on architectural and kernel details (even on the same machine, ATLAS may use a variety of storage patterns; for instance in order to handle different block shapes). Since the storage pattern is not fixed, it would seem impossible to write an LU factorization that uses it, but this can be handled by using a data copy. Our extended ATLAS framework not only generates the DGEMM kernels themselves, it also automatically generates routines that copy normal row-or column-major storage to and from its internal block storage formats. Data copies are used to manage the cache anyway, and so these can be incorporated into the design in order to allow an application to automatically use variable storage patterns. Since the copy of blocks is now explicitly performed by the application, it can be managed and minimized along with all other communication costs, rather than occurring implicitly with each BLAS call.
A. Our Contribution
These two broad ideas: using blocked PCA for smallcase performance, and exploiting overhead minimizing highperformance kernels for block operations are the essential ideas needed to improve performance across the entire range of problem sizes. In Section III we will provide an outline of our actual algorithm, which is essentially a blocked left-looking owner-computes LU factorization with infinite lookahead capabilities and multilevel blocking to improve panel factorization performance. However, these implementation details are less important than the two main ideas we described in this overview. For instance, the observed data access pattern of MKL is that of a rightlooking algorithm (we employ left-looking), and yet MKL achieves almost as good performance asymptotically as our approach. Therefore, our main contribution is in highlighting these two key concepts, and in extending the ATLAS autotuning framework to auto-tune these block computational kernels for our own and other researchers' use. Figure 2 shows the performance achieved by our algorithm (labeled as PCA-bk for blocked PCA) against the state of the art on both machines. On the Intel (Figure 2a ), our implementation (blue squares) dominates all libraries for moderately sized problems, achieving over 50% of peak for N = 1600 spread over 12 cores, and essentially ties MKL asymptotically (MKL achieves 90.06% of peak, while we get 90.7%). Figure 2b shows the results on the 32-core AMD system, where our algorithm is clearly superior for the entire curve. Note that we plotted Intel's MKL on the AMD machine, but not the reverse; this is because MKL provides the best asymptotic performance (excluding our own) on AMD, but ACML does not provide good performance on Intel.
1) Outline of Remaining Contributions:
The rest of this paper is organized as follows: Section II-B presents some implementation details that could effect the sustainability of these results, while §II-C outlines a surprising and straightforward method of improving LU performance that we previously (and erroneously) considered unimportant. Section III then explains the details of our algorithm, while §IV gives some prioritization advice on the optimizations we implemented. Finally §V will describe some future work, with §VI providing summary and conclusions.
B. Drawbacks in Our Current Implementation
Since our LU distributes blocks across a 2-D process grid, for every problem size and architecture we must choose values for the number of rows in the process grid (r) and the number of columns (c), under the constraint r × c ≤ p, where p is the number of cores. We must also choose a blocking factor (N b ); typically small problems must use a small block factor in order to increase parallelism at the expense of serial performance, while large problems will allow us to expand the block factors in order to saturate serial kernel performance. N b is further constrained by essentially two factors: (1) the size of the cache, and (2) the architecture-dependent kernel details discovered during the ATLAS kernel tuning step. This still leaves us with a large degree of flexibility in choosing the N b For this research, we empirically tuned N b , r, and c using a brute-force search that is probably too expensive to be used routinely. In the future we will need to investigate to what degree we can replace this brute-force search with a model, heuristic, or combination of these with a smarter search. If this is not done well, the smooth climb in performance we achieved here can become more stair-step like, as we transition from a smaller to larger blocking factors at inappropriate times or choose less efficient process grids. There is little doubt that a more sustainable empirical search can be constructed that keeps the overall picture roughly the same, but future research is needed to demonstrate how close we can get to these curves where we have essentially searched the entire optimization space.
In order to improve small-case performance, we do all computation via the owner-computes rule. Since block cyclic distributes the data roughly evenly across cores, if one or more cores experiences sustained unrelated load, the entire algorithm can be slowed down (see §III for details; our algorithm is not statically scheduled, but due to the ownercomputes rule a heavily loaded core's tasks will eventually get into the critical path of the algorithm). Therefore, if the library is aimed primarily at usage where multiple jobs share cores, it may make sense to enable work stealing, which will fix this issue at the cost of reducing cache reuse.
C. The Surprising Importance of Parallelizing Row-swap
In the course of this research we measured the impact of various optimizations. The most surprising result of this profiling was how important parallelizing the LAPACK routine DLASWP turned out to be. DLASWP is an LAPACK routine used mainly to swap the rows dictated by LU's pivoting strategy. As such, it is a very simple function, and its parallelization requires almost no effort (it took us less than an hour, including debugging and timing).
Since this is an O(N 2 ) cost that is completely memory hierarchy bound, we did not anticipate noticeable speedup in parallelizing this operation. However, in ATLAS's recursive LU, we got as much as 16% speedup on X12, and almost 18% speedup for O32, for the entire LU factorization (i.e. not an 18% speedup in swap, which would not have surprised us as much). This maximum speedup happens in the middle of the data range (for small problems, the • As last step, apply pivots (from panels to the right) to non-diagonal part of L Figure 4 . Basic steps for left-looking LU for square matrices row isn't long enough for parallel operation to make much difference, and for very large problems this O(N 2 ) cost is dominated). However, it appears that as parallel scale is increased, and all other operations are at least partially parallelized by the BLAS, serial DLASWP (DLASWP does not call the BLAS, and so cannot be parallelized at that level) becomes a bottleneck due to some combination of memory hierarchy effects and Amdahl's law. On X12 we saw only a 5% asymptotic speedup at N = 45, 000, but on the O32 it was still providing 13% speedup over performing the swaps serially. Therefore, we recommend that even those libraries wishing to allow the BLAS to handle the bulk of the optimization investigate parallelizing DLASWP, due to its simplicity and surprising payoff.
III. OUR APPROACH IN DETAIL

A. Understanding Left-looking LU
Our algorithm is based on the left looking variant of LU 1 . The algorithm breaks the matrix up into N b -wide column panels, as shown in Figure 3 . These panels are factorized one at a time from left to right, as outlined in Figure 4 ; the algorithm is "left-looking" because to factor column panel i, we read ("look") at only the panels to the "left" (0 ≤ j < i). The main advantage of left-looking comes when we can contain the majority of the active panel in the (collective) cache; in this case, the repeated writes to the column panel cause little or no bus traffic, and will be smoothly ejected as needed during later steps (unless the entire matrix fits in the cache, in which case they will only cause bus traffic when the final matrix is written out).
The panel factorization (step 2 of Figure 4 ) is just an LU factorization specialized for tall-skinny panel shapes, rather than square matrices. It can be implemented in a variety of ways, including recursion, (multi-level) static blocking, and using an unblocked algorithm. For now, it is enough to know that it will produce an LU factorization for a tall and skinny shaped input matrix. In Figure 3 , we assume we have already factored the first two column panels, and panel2 (third panel, dark grey) has become the active panel, which means we will apply updates arising from the first two panels in turn, before factorizing that portion of panel2 that hasn't become U (in this case, the upper 2 blocks become part of U ). After the panel factorization is called on the remaining blocks of panel2, the third block from the top will become this panel's diagonal block, with the blocks beneath holding the final (but not fully pivoted) L. Therefore, the first three panels will be completely factorized, and panel3 will become the active panel. This process is outlined more formally in Figure 4 .
B. Our Overall Parallel LU Factorization
In order to describe the parallel algorithm, we will need to define some terminology. Anytime we use the word panel with no modifier, assume it is one of the column panels shown in Figure 3 . We distribute the input matrix on a r × c process grid (here we use process, thread and core interchangeably). Figure 5a shows a process grid for r = 2 and c = 3. Only r processes will work on a panel at a time and we will refer to these r processes as a pcol. So, for the 2 × 3 grid, thread 0 and 3 will be referred as pcol0, thread 1 and 4 as pcol1 and so on. Just as a matrix column or columnpanel is owned by a pcol, a given matrix row or row-panel will also be owned by a prow. A thread will now have two different views of the matrix: a global view that includes the entire input matrix, and a local view, that contains only the blocks that belong to this thread due to the block cyclic distribution. In the serial algorithm, the leftmost unfactored column panel was called the active panel, but locally, all c pcols will have a leftmost panel, which we will call the llpan (locally-leftmost panel). However, only one pcol's llpan will be the one that is currently being factorized, and we will refer to this panel as gcpan (globally-critical panel). Figure 5b shows the distribution of a matrix for 2 × 3 process grid. As you can see, all the computations on the gcpan (panel 2) would be applied by pcol2; since panels 0 and 1 have been completely factored, pcol0 and pcol1 could be applying these updates to their llpan (panel 3 and panel 4, respectively).
The scheduling of parallel work is simple in abstract, but can be complex to understand once the algorithm is heavily optimized. Therefore, Figure 5c shows our basic algorithm before optimization, which is much more straightforward. We will now discuss how our optimizations improve this implementation, before summing up our full parallel algorithm.
From Figure 5c we see that each pcol is always working on one of their local column panels (when work is available), and so computation is normally occurring in all c column panels simultaneously. This effectively translates to the parallel algorithm providing c − 1 lookahead for free. We will discuss the steps of Figure 5c in detail and discuss the scheduling optimizations we applied to improve them.
1) Copy-In:
This step does not appear in the serial algorithm; its purpose is to bring blocks of the input matrix into the cache and store them in one of ATLAS's internal storage formats. Figure 5c shows the simplest approach where each thread starts the algorithm by copying all its local blocks into this storage. The problem is that this will put an initial massive load on the bus, resulting in strongly reduced parallel performance during this stage. Therefore, at the start of our algorithm, each thread will only copy their share of the first panel they will work on. So in Figure 5b , pcol0, pcol1 and pcol2 will only copy their share of panel 0, 1 and 2 respectively. Each thread will copy their share of subsequent panels only when they first need to access them (either because the panel has become the llpan or is examined due to lookahead). Performing the copy only when the data is needed for computation has two advantages: (1) it will produce less bus contention by spreading bus access throughout lifetime of the algorithm, and (2) since the computations on the copied data will be done right after the copy, it is much more likely we will operate on in-cache data.
2) Updates: When performing the actions of steps 2a(ii) and 2a(iii) of Figure 5c , it is necessary to await the factorization of at least the first gcpan. When a gcpan is factorized, every thread will get a signal indicating that an additional update opportunity is available.
Step 2a(ii) is performed by only one thread within the pcol owning the llpan, and the other threads need the computed U block for the next step, which means the remaining r − 1 threads are idle until step 2a(ii) is done (we will discuss removing this idle time later). Note that when U is formed in step 2a(ii), it is the final U of the algorithm. Therefore, in order to spread out the bus utilization of step 3, this block will be copied back to original storage as soon as the thread owning it completes its portion of step 2. Note that one pcol's llpan is actually the gcpan, which is why it is critical to finish the panel (that other pcol's are waiting on) before performing the copy-back of U .
3) Panel Factorization: Once all the updates are applied on a panel, it is by definition the gcpan, and threads working on this panel will start working on the panel factorization, which can be done in several ways. In section III-C1, we will discuss in detail on how we do the panel factorization, but here we mention a small but important (particularly for small and medium sized problems) optimization that effects the first gcpan only. The first panel is special in that all other pcols are idle, and this panel proceeds directly to the panel factorization. Since everyone must wait on this panel to be factorized, we prioritize its operations by having all other threads delay their normal copy-in step until the first panel has been copied to the local storage of pcol0. At that point, pcol0 signals that other threads may begin the copy-in of their llpan while pcol0 performs the panel factorization on in-cache data (at least for moderately sized problems). This prioritization allows the critical path to utilize the entire bus bandwidth during the copy-in stage, by artificially idling the rest of the pcols, which will then do their copy-in using shared bandwidth while the critical-path panel factorization is performed by pcol0.
After each panel factorization is performed (i.e., this discussion applies to all panels, not just the first), the pcol will copy back the factorized panel (the U blocks have already been copied back, as previously discussed). The diagonal block, like U , is in its final form, and so when copyback is complete it is done. The non-diagonal portion of L, however, will be subject to later pivoting, and is therefore not in its final form even though it has been copied back to the input matrix.
4) Copy-back:
In our optimized algorithm, this has been merged into prior steps, and no longer exists as a distinct step.
5) Pivoting to the left: As described in Figure 5c , the pivots coming from later panels (to the right of the panel of L under consideration) are not applied until all panels have been factored. This is normally a hard requirement of left-looking algorithms due to data dependencies. However, since we actually do our updates using local copies of L, we are free to pivot those columns that have been copied back to the original storage at any time. Note that as the algorithm factorizes the last c panels, each time a panel is factored one pcol becomes idle. As an easy optimization, these idle pcols instead begin applying the later (to the right) pivots to the portions of L that have been written back to the original storage. For simplicity, we chose to just divide the swap work amongst all cores in this initial implementation, which means that even the last pcol to complete the algorithm will still need to do some L swaps (though the lion's share will have been done while the factorization was still proceeding). Since swap is very demanding on the bus, this will tend to spread out the bus access better than doing all swaps at the same time. In the future, we should introduce completely dynamically scheduled swapping in this step so that it is possible that all swaps of L complete at the same time as the overall factorization.
C. More Complicated Optimizations for Improving Scaling
Up until now, we have discussed only easily implemented improvements over the straightforward algorithm given in Figure 5c . However, further optimization is needed to make this algorithm competitive with the state of the art. Therefore, in this section we explore more complex optimizations that collectively improved our performance by roughly 40% for moderate sized problems, and almost 9% asymptotically on the O32 (the impact was less on the X12 due primarily to its lesser scale). Note that these improvements are sometimes complementary, so applying all of them gives a greater benefit than applying them individually.
1) Panel Factorization:
For problems of reasonable size, the panel factorization is dominated by the higher-order computations coming from DGEMM and DTRSM. Therefore, for simplicity our first implementation performed the panel factorization serially. More specifically, all cores in the pcol would copy their local blocks back to the original matrix, where we could call ATLAS's DGETRF serially, after which all cores would copy their local blocks back to local storage. When we profiled our code using the serial factorization, this appeared adequate: the panel factorization time was negligible, and the main idle time it induced that we could see came from having r − 1 cores in the pcol awaiting the results of the panel factorization (the other pcols could typically remain busy by performing lookahead, as discussed later).
However, we knew that this panel factorization was always in the critical path, and so we later parallelized it even though it seemed the impact might be small. What we found was that even when the panel factorization time was negligible, we could see large overall performance impacts. The reasons for this disproportionate impact are a mixture of critical path optimizations reducing idle time, and improved cache effects. To understand this better, imagine in the middle of the algorithm pcol0 is working on the gcpan; when the factorization is serial, the serial core tends to flush its cache doing the panel factorization; meantime other pcols have run out of updates for their llpan and begin applying earlier updates to panels to the right (lookahead) to avoid idling. If they go through too many such panels (or one very large panel), they will also flush their cache so that llpan will need to reloaded, and one of the reloads will be in the critical path, and perhaps fighting for bandwidth with other pcols' reloads.
One lesson is that when scheduling is not static, critical path optimizations can provide performance benefits that are not correlated with their raw compute times or even the easily measured idle times of other cores. Commenting out the operation in question can give you a rough idea of the impact of parallelization, but even this will not capture all cache effects. Therefore, it is highly recommended to at least prototype and time potential optimizations that effect the critical path and therefore the implied scheduling.
For simplicity, our first parallel panel factorization was an unblocked implementation. We later extended this to blocking at two levels. For this work we fixed the inner blocking at 4 (this is the smallest dimension DGEMM that our framework can meaningfully tune), with an outer blocking of 12 (this was dictated by the architectural preferences of the two machines). In general, the performance improvement in going from unblocked to blocked depends on many factors (whether data is in or out of collective cache, the speed of the small-sized GEMM kernels, etc.), but its advantage should typically grow strongly with the full factorization's blocking factor. If the panel overflows the pcol's collective cache, its advantage should also grow with parallel scale (due to reduced bus contention).
2) Infinite lookahead: Panel updates always must wait on the factorization of gcpan, and so non-gcpan pcols will tend to finish updating their llpan (using already factored panels of L), and become idle (particularly if the panel factorization is performed serially, as in our initial implementation). To reduce this idle time, we introduced to a form of dynamic scheduling that essentially provides infinite lookahead. The idea is threads will always prioritize their llpan, but when they run out of updates for it, they will begin to apply prior updates (already applied to their llpan) to panels further to the right. The longer the gcpan computations take, the more lookahead will be performed, and the more our "leftlooking" algorithm will begin to resemble a "right-looking" variant. Lookahead is illustrated in Figure 6a , assuming panels 0 and 1 have already been factored, and panel2 (owned by pcol2) is the gcpan. If at this point pcol0 has already applied the first two panel updates to its llpan (panel3), it will move to it's next panel (panel 6) to apply panel0's (and possibly panel1's) updates. Having made the decision to update panel6, pcol0 will apply panel0's updates, and then it will check if the factorization of panel2 is complete. If so, it leaves panel6 updated only by the first panel, and returns to updating its llpan, which is always the panel most in the critical path of the algorithm. Therefore, the transition back to working on panel3 shown in Figure 6b can occur after any number of updates are applied to a panel on the right, which requires the algorithm to keep track of the number of updates applied to each panel independently. In scheduling, we never move to a panel to the right until all available updates are done to all panels to the left of that panel (therefore, if we returned to panel3 after applying only one panel to panel6, and we later became idle we would next apply panel1's updates to panel6 before applying panel0's updates to a notional panel9 (not shown in figure) ).
3) Increased Parallelism of DLASWP and DTRSM: Note that step 2ii of Figure 5c is wholly serial, due to the ownercomputes rule. Only one thread within the pcol owns the block that is changed into a block of U using the previous panel's diagonal block, and so all r − 1 other threads within the pcol are idle while the thread owning this block computes the new portion of U using DLASWP and DTRSM. The swap is an O(N 2 b ) memory-bound operation, while the DTRSM is O(N 3 b ) compute-bound operation. Therefore, as long as N b is small, these costs are trivial, but as N b rises, this serial operation can become problematic. Recall that we use the owner-computes rule primarily to avoid unnecessary cache pollution, but that is not as important in this case. To understand why, recall that once we finish this step, we will proceed to step 2iii, which will require every core on the pcol to load all of the produced block of U into its cache anyway, and so the only cache pollution will come from having all cores in the pcol read the diagonal block, rather than just one core.
Therefore, we violate the owner-computes rule and divide the columns of this prospective U block amongst all r cores in the pcol, and each one performs the swaps and updates on its reduced-length rows in parallel. Because we needed a parallel sync after the serial operation was performed anyway, this parallelization requires no additional parallel overhead. However, for small block factors it is nonetheless still the case that overall performance can be slightly lower for this parallel operation than the serial; the reason is that DTRSM is called with fewer right hand sides, which means the operation gets less cache reuse, and may wind up calling a cleanup case within the kernel itself. If the reduced serial performance is not made up by the increased scale, this parallelization will result in a slight slowdown. In our implementation, we always parallelize this operation regardless of N b , judging that the performance loss on very small problems is too minor to justify empirically tuning the crossover between serial and parallel operations (this crossover point is almost wholly determined by architecture and kernel implementation details that could not reasonably be captured in any a priori model).
IV. PRIORITIZING THE COMPLEX OPTIMIZATIONS
Of the more complex optimizations surveyed in §III-C, the most difficult to implement is probably infinite lookahead. This technique is conceptually simple, but in practice it tends to introduce race conditions that are difficult to debug. This optimization is mostly important in the 400 ≤ N ≤ 2000 range, where lack of lookahead can drop achieved performance by as much as 17% for X12 and 9% for O32. Asymptotically, its tendency to occasionally flush the cache makes it very slightly more hindrance than help.
Parallel panel factorization is crucial for asymptotic performance at scale: on O32, the largest problem loses almost 3% performance when the panel factorization is done in serial, though it only loses around 1% if the parallel factorization is unblocked. The blocked panel factorization is not meaningfully more complex to parallelize than the unblocked, but since it is conceptually more complicated, we found it easier to first implement the unblocked, and with the parallelization debugged, write the blocked version. Since not having a blocked panel factorization could conceivably prevent you from using a larger N b , it is probably worth implementing the blocked parallel panel factorization if you are concerned with asymptotic performance.
Parallelizing the DLASWP and DTRSM by violating the owner-computes rule has its greatest impact for mid-range (say around N = 8000, where not implementing it can cost you as much as 5% on both machines); around this size, its low-order term is not yet dominated by DGEMM, but the N b has gotten large enough for pcol idle-time to be a problem. Asymptotically it gives a performance boost in the 1-1.5% range. This optimization is worth performing due to its straightforward implementation.
V. FUTURE WORK
See Section II-B for discussion of the need for developing some combination of model, heuristic and empirical tuning for choosing grid and blocking parameters, and of possibly adding work stealing. In order to apply our approach to many more applications, we need to investigate encapsulating our synchronization and communication in library calls, so that architecture-specific code is not embedded in the application. We also need to measure the impact of using standard mutexes on small-case performance, since the cache-based communication we used here will not work on systems with weakly-ordered cache coherence (ARM and PowerPC both have weakly ordered cache systems). Finally, we must document our extensions to the ATLAS framework for others' use, and generalize it so any number of applications can ask for particular kernels without causing namespace collisions. When these investigations are complete, we should apply these techniques to all data precisions of all the factorizations (LU, Cholesky, QR variants).
Longer term, there is limited room for improvement of asymptotic performance, and most appreciable performance gains will likely come from increasing our serial DGEMM efficiency. On the other hand, moderate-sized problems still have room for improvement; most of our current ideas in this area depend on increasingly platform-dependent bus management strategies, which are probably not worth implementing unless we have a compelling use case (i.e. a critical real-life application that requires solving a series of dependent, fixed-sized problems).
VI. SUMMARY AND CONCLUSION
We have introduced a new approach to LU factorization, and shown that it handles both small and asymptotic problems better than the tiled/BLAS approaches used in PLASMA and FLAME. Further, we have shown that it produces the best known performance on large-scale representative Intel and AMD shared memory architectures across all problem sizes, including vendor-supplied libraries such as MKL and ACML. This approach is widely applicable, and the tuning framework we developed in ATLAS can be used to achieve extremely high parallel performance for any DGEMM-based operation. More broadly, if an empirical kernel framework like that of ATLAS is constructed, this parallelization approach should be effective for any blockable HPC application. Along with the critical importance of using the approach advocated in this paper for parallelization, we draw three subsidiary conclusions: (1) In the classic parallelization approach, parallelizing DLASWP is surprising important. This routine's parallelization is extremely straightforward, and so it should be done by all libraries, (2) For tiled libraries like PLASMA it is necessary to vary the blocking factors with problem size, and (3) Profiling critical path operations can massively under-predict the performance impact of optimizing them, due to scheduling differences and cache effects, and we therefore recommend that all possible critical-path optimization should be prototyped and timed, even when their potential impact appears small.
