A Case for Malleable Thread-Level Linear Algebra Libraries: The LU
  Factorization with Partial Pivoting by Catalán, Sandra et al.
A Case for Malleable Thread-Level Linear Algebra
Libraries:
The LU Factorization with Partial Pivoting
Sandra Catala´na, Jose´ R. Herrerob, Enrique S. Quintana-Ort´ıa, Rafael
Rodr´ıguez-Sa´ncheza, Robert van de Geijnc
aDepto. Ingenier´ıa y Ciencia de Computadores, Universidad Jaume I, Castello´n, Spain.
bDept. d’Arquitectura de Computadors, Universitat Polite`cnica de Catalunya, Spain.
cDepartment of Computer Science, Institute for Computational Engineering and
Sciences, The University of Texas at Austin, USA.
Abstract
We propose two novel techniques for overcoming load-imbalance encoun-
tered when implementing so-called look-ahead mechanisms in relevant dense
matrix factorizations for the solution of linear systems. Both techniques
target the scenario where two thread teams are created/activated during
the factorization, with each team in charge of performing an independent
task/branch of execution. The first technique promotes worker sharing (WS)
between the two tasks, allowing the threads of the task that completes first
to be reallocated for use by the costlier task. The second technique allows
a fast task to alert the slower task of completion, enforcing the early termi-
nation (ET) of the second task, and a smooth transition of the factorization
procedure into the next iteration.
The two mechanisms are instantiated via a new malleable thread-level
implementation of the Basic Linear Algebra Subprograms (BLAS), and their
benefits are illustrated via an implementation of the LU factorization with
partial pivoting enhanced with look-ahead. Concretely, our experimental
results on a six core Intel-Xeon processor show the benefits of combining
WS+ET, reporting competitive performance in comparison with a task-
parallel runtime-based solution.
Email addresses: catalans@uji.es (Sandra Catala´n), josepr@ac.upc.edu (Jose´ R.
Herrero), quintana@uji.es (Enrique S. Quintana-Ort´ı), rarodrig@uji.es (Rafael
Rodr´ıguez-Sa´nchez), rvdg@cs.utexas.edu (Robert van de Geijn)
Preprint submitted to arXiv.org October 15, 2018
ar
X
iv
:1
61
1.
06
36
5v
1 
 [c
s.D
C]
  1
9 N
ov
 20
16
Keywords:
1. Introduction
In the 1970s and 80s, the scientific community recognized the value of
defining standard interfaces for dense linear algebra (DLA) operations with
the introduction of the Basic Linear Algebra Subprograms (BLAS) [1, 2, 3].
Ever since, the key to performance portability in this domain has been the de-
velopment of highly-optimized, architecture-specific implementations of the
BLAS, either by hardware vendors (e.g., Intel MKL [4], AMD ACML [5],
IBM ESSL [6], and NVIDIA CUBLAS [7]) or independent developers (e.g.,
GotoBLAS [8, 9], OpenBLAS [10], ATLAS [11], and BLIS [12]).
Multi-threaded instances of the BLAS for current multi-core processor
architectures take advantage of the simple data dependencies featured by
these operations to exploit loop/data-parallelism at the block level (here-
after referred to as block-data parallelism, or BDP). For more complex DLA
operations, like those supported by LAPACK [13] and libflame [14], exploit-
ing task-parallelism with dependencies1 (TP) is especially efficient when per-
formed by a runtime that semi-automatically decomposes the computation
into tasks and orchestrates their dependency-aware scheduling [15, 16, 17, 18].
For the BLAS kernels though, exploiting BDP is still the preferred choice,
because it allows tighter control on the data movements across the memory
hierarchy and avoids the overhead of a runtime that is unnecessary due to
the (mostly) nonexistent data dependencies in the BLAS kernels.
Exploiting both BDP and TP, in a sort of nested parallelism can yield
more efficient solutions as the number of cores in processor architectures
continues to grow. For example, consider an application composed of two
independent tasks, TA and TB, both of which are inherently parallel and
preceded/followed by synchronization points. In this scenario, exploiting TP
only is inefficient, because it can keep at most 2 threads busy. To address
this, we could take advantage of the BDP inside TA and TB via TP but, given
their inherent parallelism, this is likely to incur certain overhead compared
1We view TP as the type of concurrency present in an operation that can be decomposed
into a collection of suboperations (tasks) connected by a rich set of data dependencies.
Compared with this, BDP is present when the operation basically consists of a number of
independent (inherently parallel) suboperations, each acting on a disjoint block of data.
2
with a direct exploitation of BDP. Finally, extracting BDP only is surely
possible, but it may be less scalable than a nested TP+BDP solution that
splits the threads into two teams, and puts them to work on TA and TB
concurrently.
Let us consider now that the previous scenario occurs during the exe-
cution of a complex DLA operation where both tasks, TA and TB, can be
computed via simple calls to a multi-threaded instance of BLAS. Although
the solution seems obvious, exploiting nested TP+BDP parallelism here can
still be suboptimal. In particular, all multi-threaded instances of BLAS offer
a rigid interface to control the threading execution of a routine, which only
allows one to set the number of threads that will participate before the routine
is invoked. Importantly, this number cannot be changed during its execution.
Thus, in case TA is completed earlier, the team of threads in charge of its
execution will remain idle waiting for the completion of TB, producing a
suboptimal execution from a performance perspective.
The scenario that we have described is far from being rare in DLA. To
illustrate this, we will employ the LU factorization with partial pivoting for
the solution of linear systems [19]. High-performance algorithms for this
decomposition consist of a loop-body that processes the matrix from its top-
left corner to the bottom-right one, at each iteration computing a panel
factorization and updating a trailing submatrix via calls to BLAS. We will
review this factorization as a prototypical example to make the following
contributions in our paper:
• Malleable DLA libraries: We introduce a malleable thread-level imple-
mentation of BLIS [12] that allows the number of threads that par-
ticipate in the execution of a BLAS kernel to dynamically change at
execution time.
• Worker Sharing (WS): In case the panel factorization is less expensive
than the update of the trailing submatrix, we leverage the malleable
instance of the BLAS to improve workload balancing and performance,
by allowing the thread team in charge of the panel factorization to be
reallocated to the execution of the trailing update.
• Early Termination (ET): To tackle the opposite case, where panel fac-
torization is more expensive than the update of the trailing submatrix,
we design an ET mechanism that allows the thread team in charge of
the trailing update to communicate the alternative team of this event.
3
This alert forces an ET of the panel factorization, and the advance of
the factorization into the next iteration.
• We perform a comprehensive experimental evaluation on a 6-core Intel
Xeon E5-2603 v3 processor, using execution traces to illustrate actual
benefits of our approach, and comparing its performance to those ob-
tained with a runtime-based solution using OmpSs [15].
The key to our approach is that we depart from conventional instances of
BLAS to instead view the cores/threads as a pool of computational resources
that, upon completing the execution of a BLAS/LAPACK routine, can be
tapped to participate in the execution of another BLAS/LAPACK routine
that is already in progress. This WS supports a dynamic choice of the algo-
rithmic block size as the operation progresses. Furthermore, the same idea
carries over to all other major matrix decompositions for the solution of linear
systems, such as the QR, Cholesky and LDLT factorizations [19].
2. The BLIS Implementation of Basic Linear Algebra Kernels
BLIS is a framework that allows developers to rapidly deploy new high-
performance implementations of BLAS and BLAS-like operations on current
and future architectures [12]. A key property of the BLIS open source ef-
fort is that it exposes the internal implementation of the BLAS kernels at a
finer-grain level than OpenBLAS or commercial libraries while offering per-
formance that is competitive with GotoBLAS, OpenBLAS, Intel MKL, and
ATLAS [20, 21]. We start by reviewing the design principles that underlie
BLIS, using the implementation of gemm as a particular case study.
Consider the matrices A ∈ Rm×k, B ∈ Rk×n, and C ∈ Rm×n. BLIS
mimics GotoBLAS to implement the gemm kernel2
C += A ·B (1)
as three nested loops around a macro-kernel plus two packing routines (see
Loops 1–3 in Figure 1). The macro-kernel is then implemented in terms of
2Actually, the kernel in the BLAS interface/BLIS implementation for gemm com-
putes C = αC + βop(A) · op(B), where α, β are scalars, op(·) performs an optional
transposition/Hermitian-conjugation, and op(A) is m × k, op(B) is k × n, C is m × n.
For simplicity, in the description we address the case where α = β = 1 and the operator
op(·) does not perform any transformation on the input matrix.
4
Loop 1 for jc = 0, . . . , n− 1 in steps of nc
Loop 2 for pc = 0, . . . , k − 1 in steps of kc
B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc // Pack into Bc
Loop 3 for ic = 0, . . . ,m− 1 in steps of mc
A(ic : ic +mc − 1, pc : pc + kc − 1) → Ac // Pack into Ac
Loop 4 for jr = 0, . . . , nc − 1 in steps of nr // Macro-kernel
Loop 5 for ir = 0, . . . ,mc − 1 in steps of mr
Cc(ir : ir +mr − 1, jr : jr + nr − 1) // Micro-kernel
+= Ac(ir : ir +mr − 1, 0 : kc − 1)
· Bc(0 : kc − 1, jr : jr + nr − 1)
endfor
endfor
endfor
endfor
endfor
Figure 1: High performance implementation of gemm in BLIS. In the code, Cc ≡ C(ic :
ic +mc− 1, jc : jc +nc− 1) is just a notation artifact, introduced to ease the presentation
of the algorithm. In contrast, Ac, Bc correspond to actual buffers that are involved in data
copies.
two additional loops around a micro-kernel (Loops 4 and 5 in that figure).
The loop ordering embedded in BLIS, together with the packing routines
and an appropriate choice of the BLIS cache configuration parameters (nc,
kc, mc, nr and mr), orchestrate a regular pattern of data transfers across the
levels of the memory hierarchy, and amortize the cost of these transfers with
enough computation from within the micro-kernel [12] to attain near-peak
performance. In most architectures, mr, nr are in the range 4–16; mc, kc are
in the order of a few hundreds; and nc can be up to a few thousands [12, 20].
The parallelization of BLIS’s gemm for multi-threaded architectures has
been analyzed for conventional symmetric multicore processors [20], modern
many-threaded architectures [21], and asymmetric multicore processors [22].
In all these cases, the parallel implementation exploits the BDP exposed by
the nested five-loop organization of gemm, at one or more levels (i.e., loops),
using OpenMP or POSIX threads.
A convenient option in most single-socket systems is to parallelize either
Loop 3 (indexed by ic), Loop 4 (indexed by jr), or a combination of both [20,
21, 22]. For example, when Loop 3 is parallelized, each thread packs a
different macro-panel Ac into the L2 cache and executes a different instance of
the macro-kernel. In contrast, when Loop 4 is parallelized, different threads
will operate on independent instances of the micro-kernel, but access the
same macro-panel Ac in the L2 cache.
Consider, for example, a version of BLIS gemm that extracts BDP from
5
cm
nr
k c
cm
Threads
nrnc
nc
k c
...
...
+= .cA cBc cc cccC(i  :i  +m  −1,j  :j  +n  −1)
Figure 2: Distribution of the workload among t = 3 threads when Loop 4 of BLIS gemm
is parallelized. Different colors in the output C distinguish the panels of this matrix that
are computed by each thread as the product of Ac and corresponding panels of the input
Bc.
Loop 4 only, to be executed on a multicore architecture with t (physical)
cores and one thread mapped per core. The iteration space of Loop 4 is
then statically distributed among the t threads in a round-robin fashion,
equivalent to the effect attained by adding an OpenMP directive #pragma
omp parallel for, with a static schedule, around that loop; see Figure 2.
To improve performance, the packing is also performed in parallel so that,
at each iteration of Loop 3, all t threads collaborate to copy and re-organize
the entries of A(ic : ic + mc − 1, pc : pc + kc − 1) into the buffer Ac.
3. Algorithms for the LU Factorization on Multi-threaded Archi-
tectures
Given a matrix A ∈ Rm×n, its LU factorization produces lower and upper
triangular factors, L ∈ Rm×n and U ∈ Rn×n respectively, such that PA =
LU , where P ∈ Rm×m defines a permutation that is introduced for numerical
stability [19]. In this section we first review the conventional unblocked and
blocked algorithms for the LU factorization, and then describe how BDP is
exploited from within them. Next, we introduce a re-organized version of the
algorithm that integrates look-ahead in order to enhance performance in a
nested TP+BDP execution.
All experiments in the remainder of the paper were obtained using ieee
double-precision arithmetic on an Intel Xeon E5-2603 v3 processor (6 cores
at a nominal frequency 1.6 GHz). The implementations were linked with
BLIS version 0.1.8 or a tailored version of this library especially developed
for this work. Unless otherwise stated, BDP parallelism is extracted only
from Loop 4 of the BLIS kernels.
6
Algorithm: [A] := LU unb(A)
A→
(
ATL ATR
ABL ABR
)
where ATL is 0× 0
while n(ATL) < n(A) do
(
ATL ATR
ABL ABR
)
→
 A00 a01 A02aT10 α11 aT12
A20 a21 A22

where α11 is a scalar
rl1. a21 := a21/α11
rl2. A22 := A22 − a21aT12(
ATL ATR
ABL ABR
)
←
 A00 a01 A02aT10 α11 aT12
A20 a21 A22

endwhile
Algorithm: [A] := LU blk(A)
A→
(
ATL ATR
ABL ABR
)
where ATL is 0× 0
while n(ATL) < n(A) do
Determine block size b(
ATL ATR
ABL ABR
)
→
 A00 A01 A02A10 A11 A12
A20 A21 A22

where A11 is b× b
RL1.
[
A11
A21
]
:= LU unb
([
A11
A21
])
RL2. A12 := trilu(A11)−1A12
RL3. A22 := A22 −A21A12(
ATL ATR
ABL ABR
)
←
 A00 A01 A02A10 A11 A12
A20 A21 A22

endwhile
Figure 3: Unblocked and blocked RL algorithms for the LU factorization (left and right,
respectively). In the notation, n(·) returns the number of columns of its argument, and
trilu(·) returns the strictly lower triangular part of its matrix argument, setting the
diagonal entries of the result to ones.
3.1. Basic algorithms and BDP
There exist a number of algorithmic variants of the LU factorization that
can accommodate partial pivoting [19]. Among these, Figure 3 (left) shows
an unblocked algorithm for the so-called right-looking (RL) variant, expressed
using the FLAME notation [23]. For simplicity, we do not include pivoting in
the following description of the algorithms, though all our actual implemen-
tations, (and in particular those employed in our experimental evaluation,)
integrate standard partial pivoting. The cost of computing the LU factoriza-
tion of an m × n matrix, via any of the algorithms presented in this paper,
is mn2−n3/3 floating-point arithmetic operations (flops). Hereafter, we will
consider square matrices of order n for which, the cost boils down to 2n3/3
flops. For the RL variants, the major part of these operations are concen-
trated in the initial iterations of the algorithm(s). For example, the first 25%
iterations account for almost 58% of the flops; the first half for 87.5%; and
the first 75% for more than 98%. Thus, the key to high performance mostly
lies in the initial stages of the factorization.
For performance reasons, dense linear algebra libraries compute the LU
factorization via a blocked algorithm that casts most computations in terms
of gemm. Figure 3 (right) presents the blocked RL algorithm. For each itera-
7
tion, the algorithm processes panels of b columns, where b is the algorithmic
block size. The three operations in the loop body factorize the “current”
panel Ap =
[
A11
A21
]
, via the unblocked algorithm (LU unb, RL1); and next
update the trailing submatrix, consisting of A12 and A22, via a triangular
system solve (trsm, RL2) and a matrix multiplication (gemm, RL3), respec-
tively. In practice, the block size b is chosen so that the successive invocations
to the gemm kernel deliver high FLOPS (flops per second) rates. If b is too
small, the performance of gemm will suffer, and so will that of the LU factor-
ization. On the other hand, reducing b is appealing as this choice decreases
the number of flops that are performed in terms of the panel factorization,
an operation that can be expected to offer significantly lower throughput
(FLOPS) than gemm. (Concretely, provided n b, the cost required for all
panel factorizations is about n2b/2 flops.) Thus, there is the tension between
these two requisites.
When the target platform is a multicore processor, the conventional par-
allelization of the LU factorization simply relies on multi-threaded instances
of trsm and gemm to exploit BDP only. Compared with this, the panel fac-
torization of Ap, which lies in the critical path of the blocked RL factorization
algorithm, exhibits a reduced degree of concurrency. Thus, depending on the
selected block size b and certain hardware features of the target architecture
(number of cores, floating-point performance, memory bandwidth, etc.), this
operation may easily become a performance bottleneck; see Figure 4.
To illustrate the performance relevance of the panel factorization, Figure 5
displays a fragment of a trace corresponding to the LU factorization of a
10, 000 × 10, 000 matrix, using the blocked RL algorithm in Figure 3, with
partial pivoting and “outer” block size b = bo = 256. (All traces in this
paper were obtained using Extrae version 3.3.0 [24].) The code is linked with
multi-threaded versions of the BLIS kernels for gemm and trsm. The panel
factorization (panel) is performed via a call to the same blocked algorithm,
with “inner” block size b = bi = 32, and also extracts BDP from the same
two kernels. With this configuration, the panel factorization represents less
than 2% of the flops performed by the algorithm. However, the trace of
the first four iterations reveals that its practical cost is much higher than
could be expected. (The cost of factorizing a panel relative to the cost of
an iteration becomes even larger as the iteration progresses.) Here we note
also the significant cost of the row permutations, which are performed via
the sequential legacy code for this routine in LAPACK (laswp). However,
8
Iter. k
(reduced concurrency)
RL1 Panel factorization
RL3
RL2
Figure 4: Exploitation of BDP in the blocked RL LU parallelization. A single thread team
executes all the operations, with less active threads for RL1 due to the reduced concurrency
of this kernel. In this algorithm, RL1 stands in the critical path.
T1-T6
Figure 5: Execution trace of the first four iterations of the blocked RL LU factorization
with partial pivoting, using 6 threads, applied to a square matrix of order 10,000, with
bo = 256, bi = 32.
this second operation is embarrassingly parallel and its execution time can
be expected to decrease linearly with the number of cores.
At this point, we note that the operations inside the loop body of the
blocked algorithm in Figure 3 (right) present strict dependencies that enforce
their computation in the order RL1 ⇒ RL2 ⇒ RL3. Therefore, there seems
to be no efficient manner to formulate a TP version of the blocked algorithm
in that figure.
3.2. Static look-ahead and nested TP+BDP
A strategy to tackle the hurdle represented by the panel factorization
in a parallel execution consists in the introduction of look-ahead [25] into
the algorithm. Concretely, during each iteration of the decomposition this
technique aims to overlap the factorization of the “next” panel with the
9
Algorithm: [A] := LU la blk(A)
Determine block size b
A→
(
ATL ATR
ABL ABR
)
, ABR →
(
APBR A
R
BR
)
where ATL is 0× 0, APBR has b columns
APBR := LU unb
(
APBR
)
while n(ATL) < n(A) do(
ATL ATR
ABL ABR
)
→
 A00 A01 A02A10 A11 A12
A20 A21 A22

where A11 is b× b
Determine block size b
% Partition into panel factorization and remainder(
A12
A22
)
→
(
AP12 A
R
12
AP22 A
R
22
)
where both AP12, A
P
22 have b columns
% Panel factorization, TPF
PF1. AP12 := trilu(A11)
−1AP12
PF2. AP22 := A
P
22 −A21AP12
PF3. AP22 := LU unb
(
AP22
)
% Remainder update, TRU
RU1. AR12 := trilu(A11)
−1AR12
RU2. AR22 := A
R
22 −A21AR12
(
ATL ATR
ABL ABR
)
←
 A00 A01 A02A10 A11 A12
A20 A21 A22

endwhile
Figure 6: Blocked RL algorithm enhanced with look-ahead for the LU factorization.
update of the “current” trailing submatrix, in practice enabling a TP version
of the algorithm with two separate branches of execution, as discussed next.
Figure 6 illustrates a version of the blocked RL algorithm for the LU
factorization re-organized to expose look-ahead. The key is to partition the
trailing submatrix into two block column panels:(
A12
A22
)
→
(
AP12 A
R
12
AP22 A
R
22
)
(2)
where AP22 corresponds to the block that, in the conventional version of the
algorithm (i.e., without look-ahead,) would be factorized during the next
iteration. This effectively separates the blocks that are modified as part of
the next panel factorization from the the remainder updates, left and right
of the 2 × 2 partitioning in (2), respectively. Proceeding in this manner
creates two coarse-grain independent tasks (groups of operations in separate
branches of execution): TPF, consisting of PF1, PF2, PF3; and TRU, composed
10
TRUTeam
TPFTeam
PF3
PF2
PF1
Iter. k RU2
RU1
Figure 7: Exploitation of TP+BDP in the blocked RL LU parallelization with look-ahead.
The execution is performed by teams TPF and TRU, consisting of tpf = 3 and tru = 8
threads, respectively. In this algorithm, the operations on the (k + 1)-th panel, including
its factorization (PF3), are overlapped with the updates on the remainder of the trailing
submatrix (RU1 and RU2).
of RU1 and RU2; see Figure 6. The “decoupling” of these block panels thus
facilitate that, in a TP execution of an iteration of the loop body of the look-
ahead version, the updates on AP12, A
P
22 and the factorization of the latter
(operations on the next panel, in TPF) proceed concurrently with the updates
of AR12, A
R
22 (remainder operations, in TRU), as there are no dependencies
between TPF and TRU.
By carefully tuning the block size b and adjusting the amount of compu-
tational resources (threads) dedicated to each of the two independent tasks,
TPF and TRU, a nested TP+BDP execution of the algorithm enhanced with
this static look-ahead can partially or totally overcome the bottleneck repre-
sented by the panel factorization; see Figure 7.
Figure 8 illustrates a complete overlap of TRU with TPF attained by the
look-ahead technique. The results in that figure correspond to a fragment of
a trace obtained for the LU factorization of a 10, 000× 10, 000 matrix, using
the blocked RL algorithm in Figure 6, with partial pivoting, and outer block
size b = bo = 256. For this experiment, the t = 6 threads are partitioned into
two teams: PF with tpf = 1 thread in charge of TPF, and RU with tru = 5
threads responsible for TRU. The panel factorization (panel) is performed
via a call to the same algorithm, with b = bi = 32, and this operation proceeds
11
Figure 8: Execution trace of the first four iterations of the blocked RL LU factorization
with partial pivoting, enhanced with look-ahead, using 6 threads, applied to a square
matrix of order 10,000, with bo = 256, bi = 32.
sequentially (as PF consists of a single thread). The application of the row
permutations is distributed between all 6 cores. As argued earlier, the net
effect of the look-ahead is that the cost of the panel factorization no longer
has a practical impact on the execution time of the (first four iterations of)
the factorization algorithm, which is now basically determined by the cost of
the remaining operations.
Given a static mapping of threads to tasks, b should balance the time
spent in the two tasks as, if the operations in TPF take longer than those
in TRU, or vice-versa, the threads in charge of the less expensive part will
become idle, causing a performance degradation. This was already visible in
Figure 8, which shows that, during the first four iterations, the operations in
TPF are considerably less expensive than the updates performed as part of
the remainder TRU. The complementary case, where TPF requires longer than
TRU, is illustrated using the same configuration, for a matrix of dimension
2, 000× 2, 000, in Figure 9. Unfortunately, as the factorization proceeds, the
theoretical costs and execution times of TPF and TRU vary, making it difficult
12
T1
T2
T4
T5
T6
T3
Figure 9: Execution trace of the first four iterations of the blocked RL LU factorization
with partial pivoting, enhanced with look-ahead, using 6 threads, applied to a square
matrix of order 2,000, with bo = 256, bi = 32.
to determine the optimal value of b, which will need to be adapted during
the factorization process.
To close this section, note that there exist strict dependencies that seri-
alize the operations within each task: PF1 ⇒ PF2 ⇒ PF3 and RU1 ⇒ RU2.
Therefore, there is no further TP in the loop-body of this re-organized ver-
sion. However, the basic look-ahead mechanism of level/depth 1 described
in this subsection can be refined to accommodate further levels of TP, by
“advancing” to the current iteration the panel factorization of the following
d iterations, in a look-ahead of level/depth d. This considerably compli-
cates the code of the algorithm, but can be seamlessly achieved by a runtime
system enhanced with priorities.
4. Advocating for Malleable Thread-Level LA Libraries
For simplicity, in the following discussion we will assume that TPF and TRU
consist only of the panel factorization involving AP22 (PF3) and the update of
AR22 (RU2), respectively. Furthermore, we will consider a nested TP+BDP
13
execution using t = tpf + tru threads, initially with a team PF of tpf threads
mapped to the execution of PF3 and a team RU of tru threads computing
RU2.
Ideally, for the LU factorization with look-ahead, we would like to perform
a flexible sharing of the computational resources so that, as soon as the
threads in team PF complete PF3, they join team RU to help in the execution
of RU2 or vice-versa.We next discuss these two cases in detail.
4.1. Worker sharing: Panel factorization less expensive than update
Our goal is to enable, at each iteration of the algorithm for the LU fac-
torization with look-ahead, the threads in team PF that complete the panel
factorization join the thread team RU working on the update. The problem
is that, if the multiplication to update AR22 was initiated via an invocation to
a traditional gemm, this is not possible as none of the existing high perfor-
mance implementations of BLAS allow the number of threads working on a
kernel that is already in execution to be modified.
4.1.1. Suboptimal solution: Static re-partitioning
A simple workaround for this problem is to split AR22 into multiple column
blocks, for example, AR22 →
(
A1 A2 . . . Aq
)
, and to perform a separate
call to BLAS gemm in order to exploit BDP during the update of each
block. Then, just before each invocation, the kernel’s code queries whether
the execution of the panel factorization is completed and, if that is the case,
executes the suboperation with the threads from both teams (or only those
of RU otherwise). Unfortunately, this approach presents several drawbacks:
• Replacing a single invocation to a coarse gemm by multiple calls to
smaller gemm may offer lower throughput because the operands passed
to gemm are smaller and/or suboptimally “shaped”. The consequence
is that calling gemm multiple times will internally incur re-packing and
data movement overheads, which are more difficult to amortize because
of the smaller problem dimensions.
• The burden of which loop to partition for parallelism (note that AR22
could have alternatively been split by rows, or into blocks), and the
granularity of this partitioning is then placed upon the programmer’s
shoulders, who may lack the information that is necessary to make a
good choice. For example, if the granularity is too coarse, this will
have negative effect because the integration of the single thread in the
14
update will likely be delayed. A granularity that is too fine, on the
other hand, may reduce the parallelism within the BLAS operation or
result in the use of cache blocking parameters that are too small.
4.1.2. Our solution: Malleable thread-level BLAS
The alternative that we propose in this work exploits BDP inside RU2, but
allows to change the number of threads that participate in this computation
even if the task is already in execution! In other words, we view the threads
as a resource pool of workers that can be shared between different tasks and
reassigned to the execution of a (BLAS) kernel that is already in progress.
The key to our approach lies in the explicit exposure of the gemm in-
ternals (and other BLAS-3 kernels) in BLIS. Concretely, assume that RU2
is computed via a single invocation to BLIS gemm, and consider that this
operation is parallelized by distributing the iteration space of Loop 4 among
the threads in team PF; (see Figures 1 and 2). Then, just before Loop 4, we
force the system to check if the execution of the panel factorization is com-
pleted and, based on this information, decides whether this loop is executed
using either the union of the threads from both teams or only those in RU;
see Figure 10.
Let us re-analyze the problems listed in Subsection 4.1.1 for the work-
around that statically partitioned the update of AR22, and compare them
with our solution that implicitly embeds this partitioning inside BLIS:
• The partitioning of gemm into multiple calls to smaller matrix multi-
plications does not occur. Our solution performs a single call to gemm
only, so that there is no additional re-packing nor data movements. For
example, in the case just discussed, Bc is already packed and re-used
independently of whether t or tru threads participate in the gemm. The
buffer Ac is packed only once per iteration of Loop 3 (in parallel by both
teams or only RU).
• The decision of the best partitioning/granularity is left in the hands of
BLIS, which likely has more information to do a better job than the
programmer.
Importantly, the partitioning happens dynamically and is transparent to the
programmer.
Figure 11 validates the effect of integrating a malleable version of BLIS
into the same configuration that produced the results in Figure 8. A compar-
ison of both figures shows that, with a malleable version of BLIS, the thread
15
..
.
for =pc
jc
ic = 0
= mic c
.
.
.
.
.
.
TRUTeam
TPFTeam
RU2for =
Entry point
Iter. k
PF3
Figure 10: Exploitation of TP+BDP in the blocked RL LU parallelization with look-ahead
and WS. The execution is performed by teams TPF and TRU, consisting of tpf = 3 and tru =
8 threads, respectively. In this example, team TPF completes the factorization PF3 while
team TRU is executing the first iteration of Loop 3 that corresponds to RU2/gemm (ic = 0).
Both teams then merge and jointly continue the update of the remaining iterations of that
loop (ic = mc, 2mc, . . .). With the parallelization of gemm Loop 4, one such “entry point”
enables the merge at the beginning of each iteration of loop ic.
executing the operations in TPF, after completing this task, rapidly joins the
team that computes the remainder updates, thus avoiding the idle wait.
Compared with BLIS, the same approach cannot be integrated into Goto-
BLAS because the implementation of gemm in this library only exposes the
three outermost loops of Figure 1, while the remaining loops are encoded in
assembly. The BLAS available as part commercial libraries is not an option
either because hardware vendors offer black-box implementations which do
not permit the migration of threads.
4.2. Early termination: Panel factorization more expensive than update
The analysis of this case will reveal some important insights. In order to
discuss them, let us consider that, in the LU factorization with look-ahead,
the panel factorization (PF3) is performed via a call to the blocked routine
16
Figure 11: Execution trace of the first four iterations of the blocked RL LU factorization
with partial pivoting, enhanced with look-ahead and malleable BLIS, using 6 threads,
applied to a square matrix of order 10,000, with bo = 256, bi = 32.
in Figure 3 (right). We assume two blocking parameters: b = bo for the
outer routine that computes the LU factorization of the complete matrix
using look-ahead, and b = bi for the inner routine that factorizes each panel.
(Note that, if bi = bo or bi=1, the panel factorization is then simply done via
the unblocked algorithm.) Furthermore, we will distinguish these two levels
by referring to them as the outer LU (factorization with look-ahead) and
the inner LU (factorization via the blocked algorithm without look-ahead).
Thus, at each iteration of the outer LU, a panel of bo columns is factorized
via a call to LU blk (inner LU), and this second decomposition proceeds to
factorize the panel using a blocked algorithm with block size bi; see Figure 12.
From Figure 3 (right), the loop body for the inner LU consists of a call to
the unblocked version of the algorithm (RL1), followed by the invocations to
trsm and gemm that update A12 and A22, respectively (RL2 and RL3). Now,
let us assume that the update RU2 by the thread team RU is completed while
the threads of team PF are in the middle of the computations corresponding
to an iteration of the loop body of the inner LU. Then, provided the versions
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 





















































 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 





















































bo
bo
bi
biIter      :
bo bo biIter            :/
Outer LU: LU_LA_BLK Inner LU: LU_BLK
LU_UNB
Iter 1:
...
Iter k: 
...
Iter 1:
...
Iter k: 
n / Iter           :
...
Iter 1:
...
Iter k: 
...
RU2
RU1PF1
PF2
PF3
RL1 RL2
RL3
RL1: 
rl1
rl2
PF1
PF2
RU1
RU2
PF3
RL1
RL2
RL3
Figure 12: Outer vs inner LU and use of algorithmic block sizes.
of the BDP versions trsm and gemm kernels that are invoked from the inner
LU are malleable (see subsection 4.1.2), inside them the system will perform
the actions that are necessary to integrate the thread team RU, which is
now idle, into the corresponding (and subsequent) computation(s). Unfortu-
nately, the updates in the loop body of the inner LU involve small-grained
computations (A12 and A22 have at most bo − bi columns, decreasing by bi
columns at each iteration), and little parallel performance can be expected
from it especially because of partial pivoting.
In order to deal with this scenario, a different option is to force the inner
LU to stop at the end of the current iteration, to then rapidly proceed to
the next iteration of the outer LU. We refer to this strategy as the early
termination (ET). In order to do this though, the transformations computed
to factorize the current inner-panel must be propagated first to the remaining
columns outside this panel, introducing a certain delay in this version of the
ET strategy.
A third possibility is to rely on a left-looking (LL) version of the LU
18
factorization for the inner LU, as discussed next. The blocked LL algorithm
for the LU factorization differs from the blocked RL variant (see the algorithm
in the right-hand side of Figure 3) in the operations performed inside the
loop-body, which are replaced by
LL1. A01 := trilu(A00)
−1A01
LL2.
[
A11
A21
]
:=
[
A11
A21
]
−
[
A10
A20
]
A01
LL3.
[
A11
A21
]
:= LU unb
([
A11
A21
])
Thus, at the end of a certain iteration, this variant has only updated the
current column of the inner-panel and those to its left. In other words,
no transformations are propagated beyond that point (i.e., to the right of
the current column/inner-panel), and ET can be implemented in a straight-
forward manner, with no delay compared with an inner LU factorization via
the RL variant.
A definitive advantage of the LL variant compared with its RL counter-
part is that the former implements a lazy algorithm, which delays the opera-
tions towards the end of the panel factorization, while the second corresponds
to an eager algorithm that advances as much computations as possible to the
initial iterations. Therefore, in case the panel factorization has to be stopped
early, it is more likely that the LL variant has progressed in the factorization
further.3 The appealing consequence is that this enables the use of larger
block sizes for the following updates in the LL variant.
From an implementation point of view, the synchronization between the
two teams of threads is easy to handle. For example, at the beginning of each
iteration of the outer LU, a boolean flag is set to indicate that the remainder
update is incomplete. The thread team RU then changes this value as soon
as this task is complete. In the mean time, the flag is queried by the thread
team PF, at the end of every iteration of the inner LU, aborting its execution
when a change is detected. With this operation mode, there is no need to
protect the flag from race conditions. This solution also provides an adaptive
3Consider the factorization of an m × n matrix that is stopped at iteration k < n.
The LL algorithm will have performed m2k −m3/3 flops at that point while, for the RL
algorithm, the flop count raises to that of the LL algorithm plus 2(n− k)(mk − k2/2).
19
RU2
TRUTeam
TPFTeam
.
.
.
.
.
.
.
.
.
Iter. k
LL1
LL4
LL4
LL1
iterations
Skipped
Alert ET
.
.
.
PF3
Figure 13: Exploitation of TP+BDP in the blocked RL LU parallelization with look-
ahead and ET. The execution is performed by teams TPF and TRU, consisting of tpf = 3
and tru = 8 threads, respectively. In this example, team TRU completes the update RU2
while team TPF is executing an iteration of the panel factorization PF3. TRU then notifies
of this event to TPF, which then skips the remaining iterations of the loop that processes
the panel.
(automatic) configuration of the block size as, if chosen too large, it will be
adjusted for the current (and, possibly, subsequent) iterations by the early
termination of the inner LU. The process is illustrated in Figure 13.
4.3. Relation to adaptive look-ahead via a runtime
Compared with our approach, which only applies look-ahead at one level,
a TP execution that relies on a run-time for adaptive-depth look-ahead ex-
poses a higher degree of parallelism from “future iterations”, which can amor-
tize the cost of the panel factorization over a considerably larger number
of flops. This can be beneficial for architectures with a large number of
cores, but can be partially compensated by increasing the number of threads
dedicated to the panel factorization, combined with a careful fine-grain ex-
ploitation of the concurrency [26], in our approach. On the other hand,
20
adaptive-depth look-ahead via a runtime suffers from re-packing and data
movement overheads due to multiple calls to gemm. Moreover, it couples
the algorithmic block size that fixes the granularity of the tasks to that of
the suboperands in gemm. Finally, runtime-based solutions rarely exploit
nested TP+BDP parallelism and, even if they do so, taking advantage of a
malleable thread-level BLAS from within them may be difficult.
5. Experimental Evaluation
In this section we analyze in detail the performance behavior of several
multi-threaded implementations of the algorithms for the LU factorization:
• LU: Blocked RL (Figure 3). This code only exploits BDP, via calls to
the (non-malleable) multi-threaded BLIS (version 0.1.8).
• Variants enhanced with look-ahead (Figure 6). The following three
implementations take advantage of nested TP+BDP, with 1 thread
dedicated to the operations on the panel (team TPF ) and t − 1 to the
remainder updates (team TRU).
– LU LA (subsection 3.2): Blocked RL with look-ahead.
– LU MB (subsection 4.1.2): Blocked RL with look-ahead and mal-
leable BLIS.
– LU ET (subsection 4.2): Blocked RL with look-ahead, malleable
BLIS, and early termination of the panel factorization.
• LU OS: Blocked RL with adaptive look-ahead extracted via the OmpSs
runtime (version 16.06). LU OS decomposes the factorization into a
large collection of tasks connected via data dependencies, and then
exploits TP only, via calls to a sequential instance of BLIS. In more
detail, the OmpSs parallel version divides the matrix into a collection
of panels of fixed width bo. All operations performed during an itera-
tion of the algorithm on the same panel (row permutation, triangular
system solve, matrix multiplication and, possibly, panel factorization)
are then part of the same task. This implementation includes priorities
to advance the schedule of tasks involving panel factorizations.
All codes include standard partial pivoting and compute the same factor-
ization. Also, all solutions perform the panel factorization via the blocked
21
RL algorithm, except for LU ET and LU OS, which employ the blocked LL
variant. The performance differences between the LL and RL variants, when
applied solely to the panel factorization, were small. Nonetheless, for LU ET,
employing the LL variant improves the ET mechanism and unleashes a faster
execution of the global factorization. For LU OS we integrated the LL vari-
ant as well to favor a fair comparison between this implementation and our
LU ET. The block size is fixed to bo during the complete iteration in all cases,
except for LU ET which initially employs bo, but then adjusts this value dur-
ing the factorization as part of the ET mechanism.
In the experiments, we considered the factorization of square matrices,
with random entries uniformly distributed in (0, 1), and dimension n=500 to
12,000 in steps of 500. The block size for the outer LU was tested for values
bo=32 to 512 in steps of 32. The block size for the inner LU was evaluated for
bi=16 and 32. We employed one thread per core (i.e., t = 6) in all executions.
5.1. Optimal block size
The performance of the blocked LU algorithms is strongly influenced by
the outer block size bo. As discussed in subsection 3.1, this parameter should
balance two criteria:
• Deliver high performance for the gemm kernel. Concretely, in the algo-
rithms in Figures 3 and 6, a value of bo that is too small turns A21 and
A12/A
R
12 into narrow column and row panels respectively, transforming
the matrix multiplication involving these blocks (RL3 in Figure 3 and
RU2 in Figure 6) into a memory-bound kernel that will generally deliver
low performance. In the following, we will refer to a gemm (1) with
dimensions m ≈ n  k, as a panel-panel multiplication (gepp) [8].
Note that, for the gepp arising in the LU factorizations, k = bo.
• Reduce the amount of operations performed in the panel factorization
(about n2bo/2 flops, provided n  bo), in order to avoid the negative
impact of this intrinsically sequential stage.
Figure 14 sheds further light on the roles played by these two factors. The
plot in the left-hand side reports the performance of gepp, in terms of
GFLOPS (billions of FLOPS), showing that the implementation of this kernel
22
 0
 10
 20
 30
 40
 50
 60
 70
 80
 90
 100
 110
 120
 130
 0  64  128  192  256  320  384  448  512
G
FL
O
PS
Problem dimension k
GEPP on Intel Xeon E5-2603 v3 
m=n=10000
m=n=  8000
m=n=  6000
m=n=  4000
m=n=  2000
 0
 0.02
 0.04
 0.06
 0.08
 0.1
 0.12
 0.14
 0.16
 0.18
 0.2
 0  64  128  192  256  320  384  448  512
Pe
rc
en
ta
ge
 o
f f
lo
ps
Block size
Ratio of panel factorizations to total cost
n=10000
n=  8000
n=  6000
n=  4000
n=  2000
Figure 14: GFLOPS attained with gepp (left) and ratio of flops performed in the panel
factorizations normalized to the total cost (right).
in BLIS achieves an asymptotic performance peak for k(= bo) around 144.
4
The right-hand side plot reports the ratio of flops performed in the panel
factorizations with respect to those of the LU factorization.
The combined effect of these criteria seems to point in the direction of
choosing the smallest bo that attains the asymptotic GFLOPS rate for gepp.
However, Figure 15 illustrates the experimental optimal block size bo for the
distinct LU factorization algorithms, exposing that this is not the case. We
next discuss the behavior for LU, LU LA and LU MB, which show different
trends. (LU ET and LU OS are analyzed latter.) In particular, LU benefits
from the use of larger values of bo than the other two codes for all problem
dimensions. The reason is that a large block size operates on wide panels,
which turns their factorization into a BLAS-3 operation with a mild degree of
parallelism, and reduces the impact of this computation on the critical path
of the factorization. LU LA exhibits a similar behavior for large problems,
but favors smaller block sizes for small to moderate problems. The reason is
that, for LU LA, it is important to balance the panel factorization (TPF) and
remainder update (TRU) so that their execution approximately requires the
same time.
Compared with the previous two implementations, LU MB promotes the
use of small block sizes, up to bo=192, for the largest problems. (Interest-
ingly, this corresponds to the optimal value of k for gepp.) One reason
4The performance drop observed for k slightly above 256 is due to the optimal value
of kc being equal to that number in this architecture.
23
for this behavior is that, when the malleable version of BLIS is integrated
into LU MB, the practical costs of the two branches/tasks do not need to
be balanced. Let us elaborate this case further, by considering the effect of
reducing the block size, for example, from bo to b
′
o = bo/2. For simplicity,
in the following discussion we will use approximations for the block dimen-
sions and their costs; furthermore, we will assume that n  bo. The first
and most straight-forward consequence of halving the block size is that the
number of iterations is doubled. Inside each iteration with the original block
size bo, the loop body invokes, among others kernels, a gepp of dimensions
m × (m − bo) × bo (with m the number of rows in the trailing submatrix
AR22), for a cost of 2m
2bo flops; in parallel, the factorization involves a panel
of dimension m × bo, for a cost of mb2o − b3o/3 ≈ mb2o flops. When the block
size is halved to b′o, the same work is basically computed in two consecutive
iterations. However, this reduces the amount of flops performed in terms of
panel factorizations to about 2m(b′o)
2 = mb2o/2 while it has a minor impact
on the number of flops that are cast as gepp (two of these products, at a
cost of 2m2b′o = 2m
2bo/2 flops each). The conclusion is that, by reducing
the block size, we decrease the time that the single thread spends in the
panel factorization TPF, favoring its rapid merge with the thread team that
performs the remainder update TRU. Thus, in case the execution time of the
LU is dominated by TRU, adding one more thread to perform this task (in
this scenario, in the critical path) as soon as possible will reduce the global
execution time of the algorithm.
5.2. Performance comparison of the variants with static look-ahead
The previous analysis on the effect of the block size exposes that choosing
the optimal block size is a difficult task. Either we need a model that can
accurately predict the performance of each building block appearing in the
LU factorization, or we perform an extensive experimental analysis to select
the best value. The problem is even more complex if we consider that,
in practice, an optimal selection would have to vary the block size as the
factorization progresses. Concretely, for the factorization of a square matrix
of order n via a blocked algorithm, the problem is decomposed into multiple
subproblems that involve the factorization of matrices of orders n−bo, n−2·bo,
n − 3 · bo, etc. From Figure 15, it is clear that the optimal value of bo will
be different for several of these subproblems. In the end, the value that we
show in Figure 15 for each problem has to be considered as a compromise
24
 0
 32
 64
 96
 128
 160
 192
 224
 256
 288
 320
 352
 384
 416
 448
 480
 512
 0  2000  4000  6000  8000  10000  12000
O
pt
im
al
 b
lo
ck
 s
ize
Problem dimension n
LU on Intel Xeon E5-2603 v3 
LU_OS
LU_ET
LU_MB
LU_LA
LU
Figure 15: Optimal block size of the blocked RL algorithms for the LU factorization.
 0
 10
 20
 30
 40
 50
 60
 70
 80
 90
 100
 110
 0  2000  4000  6000  8000  10000  12000
G
FL
O
PS
Problem dimension n
LU on Intel Xeon E5-2603 v3 
LU_ET
LU_MB
LU_LA
LU
Figure 16: Performance comparison of the blocked RL algorithms for the LU factorization
(except LU OS) with a fixed block size bo = 256.
that attains fair performance for a wide range of the subproblems appearing
in that case.
Figure 16 reports the GFLOPS rates attained by the distinct implemen-
tations to compute the plain LU factorization and the variants equipped with
static look-ahead (i.e., all except LU OS), using bo = 256 as a compromise
25
value for all of them. Although this value is optimal for only a few cases, the
purpose of this experiment is to show the improvements attained by grad-
ually introducing the techniques enhancing look-ahead. The figure reveals
some relevant trends:
• Except for the smallest problems, integrating the look-ahead techniques
clearly improves the performance of the plain LU factorization imple-
mented in LU.
• The version with malleable BLAS (LU MB) improves the performance
of the basic version of look-ahead (LU LA) for the larger problems.
This is a consequence of the cost of the panel factorization relative
to that of the global factorization. Concretely, for fixed bo, as the
problem size grows, the global flop-cost varies cubically in n, as 2n3/3,
while the flop-cost of the panel factorizations grows quadratically, with
n2bo/2. Thus, we can expect that, for large n, the remainder update
TRU becomes more expensive than the panel factorization TPF. This
represents the actual scenario that was targeted by the variant with
malleable BLIS.
• The version that combines the malleable BLAS with ET (LU ET) de-
livers the same performance of LU MB for large problems, but outper-
forms all other variants with static look-ahead for the smaller problems.
Again, this could be expected by considering the relative cost of the
panel factorization for small n.
5.3. Performance comparison with OmpSs
We conclude the experimental analysis by providing a comparison of the
best variant with static look-ahead, LU ET, with the implementation that
extracts parallelism via the OmpSs runtime, LU OS. In this last experiment
we depart from the previous case, performing an extensive evaluation in order
report the performance for the optimal block size for each problem dimension
and algorithm. (See Figure 15 for the actual optimal values employed in
the experiment.) For LU OS, we select a value for bo that is then fixed for
the complete factorization. As this variant overlaps the execution of tasks
from different iterations in time, it is difficult to vary the block size as the
factorization progresses. For LU ET, the selected value of bo only applies to
the first factorization. After that, the ET mechanism automatically adjusts
this value during the iteration.
26
 0
 10
 20
 30
 40
 50
 60
 70
 80
 90
 100
 110
 0  2000  4000  6000  8000  10000  12000
G
FL
O
PS
Problem dimension n
LU on Intel Xeon E5-2603 v3 
LU_ET (b_opt)
LU_OS (b_opt)
LU_ET (b=192)
LU_OS (b=256)
Figure 17: Performance comparison between the OmpSs implementation and the blocked
RL algorithm for the LU factorization with look-ahead, malleable BLIS and ET. Two
configurations are chosen for each algorithm: optimal block size for each problem size;
and fixed block sizes bo = 192 for LU ET and bo = 256 for LU OS.
Figure 17 shows the results for this comparison in the lines labelled as
“(b opt)”. LU ET is very competitive, clearly outperforming the runtime-
based solution for most problems and offering competitive performance for
the largest three.
Manually tuning the block size to each problem dimension is in general
impractical. For this reason, the figure also shows the performance curves
when the block size is fixed to bo = 192 for LU ET and bo = 256 for LU OS.
These values were selected because they offered high performance for a wide
range of problem dimensions (especially, the largest ones; see Figure 15).
Interestingly, the performance lines corresponding to this configuration, la-
belled with “(b=192)”/“(b=256)”, show that choosing a suboptimal value for
bo has a minor impact on the performance of our solution LU ET, because the
ET mechanism adjust this value on-the-fly (for the smaller problem sizes).
Compared with this, the negative effect of a suboptimal selection on LU OS
is clearly more visible.
A comparison with other parallel versions of the LU factorization with
partial pivoting is possible, but we do not expect the results change the mes-
sage of our paper. In particular, Intel MKL includes a highly-tuned routine
for this factorization that relies in their own implementation of the BLAS
27
and some type of look-ahead. Therefore, whether the advantages of one im-
plementation over the other come simply from the use of a different version
of the BLAS, or from the positive effects of our WS and ET mechanism, will
be really difficult to infer. The PLASMA library [17] also provides a routine
for the LU factorization with partial pivoting supported by a runtime that
implements dynamic look-ahead. The techniques integrated in PLASMA’s
routine are not different from those in the OmpSs implementation evaluated
in our paper. Therefore, when linked with BLIS, we do not expect a different
behaviour between PLASMA’s routine and LU OS.
6. Concluding Remarks and Future Work
We have introduced WS and ET as two novel techniques to avoid work-
load imbalance during the execution of matrix factorizations, enhanced with
look-ahead, for the solution of linear systems. The WS mechanism especially
benefits from the adoption of a malleable thread-level instance of BLIS, which
allows the thread team in charge of the panel factorization, upon completion
of this task, to be reallocated to the execution of the trailing update. The ET
mechanism tackles the opposite situation, with a panel factorization that is
costlier than the trailing update. In such scenario, the team that performed
the update communicates to the second team that it should terminate the
panel factorization, advancing the factorization process into the next itera-
tion.
Our results on an Intel Xeon E5-2603 v3 show the performance benefits
of our version enhanced with malleable BLIS and ET compared with a plain
LU factorization as well as a version with look-ahead. The experiments also
report competitive performance compared with an LU factorization that is
parallelized by means of a sophisticated runtime, such as OmpSs, that intro-
duces look-ahead of dynamic (variable) depth. Compared with the OmpSs
solution, our approach offers higher performance for most problem dimen-
sions, seamlessly tunes the algorithmic block size, and features a considerably
smaller memory footprint as it does not require a sophisticated runtime sup-
port.
To conclude, our paper does not intend to propose an alternative to
runtime-based solutions. Instead, the message implicitly carried in our ex-
periments aims to emphasize the benefits of malleable thread-level libraries,
which we expect to be crucial in order to exploit the massive thread par-
allelism of future architectures. This work opens a plethora of interesting
28
questions for future research. In particular, how to generalize the ideas to a
multi-task scenario, what kind of interfaces may ease thread-level malleabil-
ity, and what kind of support is necessary in the runtime for this purpose.
Acknowledgements
We thank the other members of the FLAME team for their support. This
research was partially sponsored by projects TIN2014-53495-R and TIN2015-
65316-P of the Spanish Ministerio de Economı´a y Competitividad, the H2020
EU FETHPC Project 671602 “INTERTWinE”, by project 2014-SGR-1051
from the Generalitat de Catalunya, and NSF grant ACI-1550493.
Any opinions, findings and conclusions or recommendations expressed in
this material are those of the author(s) and do not necessarily reflect the
views of the National Science Foundation (NSF).
References
[1] C. L. Lawson, R. J. Hanson, D. R. Kincaid, F. T. Krogh, Basic linear
algebra subprograms for Fortran usage, ACM Trans. Math. Soft. 5 (3)
(1979) 308–323.
[2] J. J. Dongarra, J. Du Croz, S. Hammarling, R. J. Hanson, An extended
set of FORTRAN basic linear algebra subprograms, ACM Trans. Math.
Softw. 14 (1) (1988) 1–17.
[3] J. J. Dongarra, J. Du Croz, S. Hammarling, I. Duff, A set of level 3 basic
linear algebra subprograms, ACM Trans. Math. Softw. 16 (1) (1990) 1–
17.
[4] Intel, Math Kernel Library, https://software.intel.com/en-us/
intel-mkl (2015).
[5] AMD, AMD Core Math Library, http://developer.amd.com/
tools-and-sdks/cpu-development/amd-core-math-library-acml/
(2015).
[6] IBM, Engineering and Scientific Subroutine Library, http://www-03.
ibm.com/systems/power/software/essl/ (2015).
[7] NVIDIA, cublas, https://developer.nvidia.com/cublas (2016).
29
[8] K. Goto, R. A. v. d. Geijn, Anatomy of high-performance matrix
multiplication, ACM Trans. Math. Softw. 34 (3) (2008) 12:1–12:25.
doi:10.1145/1356052.1356053.
URL http://doi.acm.org/10.1145/1356052.1356053
[9] K. Goto, R. van de Geijn, High performance implementation of the level-
3 BLAS, ACM Transactions on Mathematical Software 35 (1) (2008)
4:1–4:14.
URL http://doi.acm.org/10.1145/1377603.1377607
[10] http://www.openblas.net (2015).
[11] R. C. Whaley, J. J. Dongarra, Automatically tuned linear algebra soft-
ware, in: Proceedings of SC’98, 1998.
[12] F. G. Van Zee, R. A. van de Geijn, BLIS: A framework for rapidly in-
stantiating BLAS functionality, ACM Trans. Math. Softw. 41 (3) (2015)
14:1–14:33.
[13] E. Anderson, Z. Bai, L. S. Blackford, J. Demmesl, J. J. Dongarra, J. D.
Croz, S. Hammarling, A. Greenbaum, A. McKenney, D. C. Sorensen,
LAPACK Users’ guide, 3rd Edition, SIAM, 1999.
[14] F. G. V. Zee, libflame: The Complete Reference, www.lulu.com, 2009.
[15] OmpSs project home page, http://pm.bsc.es/ompss.
[16] StarPU project, http://runtime.bordeaux.inria.fr/StarPU/.
[17] PLASMA project home page, http://icl.cs.utk.edu/plasma.
[18] FLAME project home page, http://www.cs.utexas.edu/users/
flame/.
[19] G. H. Golub, C. F. V. Loan, Matrix Computations, 3rd Edition, The
Johns Hopkins University Press, Baltimore, 1996.
[20] F. G. V. Zee, T. M. Smith, B. Marker, T. M. Low, R. A. V. D.
Geijn, F. D. Igual, M. Smelyanskiy, X. Zhang, M. Kistler, V. Aus-
tel, J. A. Gunnels, L. Killough, The BLIS framework: Experiments
in portability, ACM Trans. Math. Softw. 42 (2) (2016) 12:1–12:19.
doi:10.1145/2755561.
URL http://doi.acm.org/10.1145/2755561
30
[21] T. M. Smith, R. van de Geijn, M. Smelyanskiy, J. R. Hammond, F. G.
Van Zee, Anatomy of high-performance many-threaded matrix multi-
plication, in: Proc. IEEE 28th Int. Parallel and Distributed Processing
Symp., IPDPS’14, 2014, pp. 1049–1059.
[22] S. Catala´n, F. D. Igual, R. Mayo, R. Rodr´ıguez-Sa´nchez, E. S. Quintana-
Ort´ı, Architecture-aware configuration and scheduling of matrix multi-
plication on asymmetric multicore processors, Cluster Computing 19 (3)
(2016) 1037–1051. doi:10.1007/s10586-016-0611-8.
URL http://dx.doi.org/10.1007/s10586-016-0611-8
[23] J. A. Gunnels, F. G. Gustavson, G. M. Henry, R. A. van de Geijn,
FLAME: Formal linear algebra methods environment, ACM Trans.
Math. Softw. 27 (4) (2001) 422–455.
[24] B. S. Center, Extrae user guide manual for version 2.5.1 (2016).
[25] P. Strazdins, A comparison of lookahead and algorithmic blocking tech-
niques for parallel matrix factorization, Tech. Rep. TR-CS-98-07, De-
partment of Computer Science, The Australian National University,
Canberra 0200 ACT, Australia (1998).
[26] A. M. Castaldo, R. C. Whaley, S. Samuel, Scaling LAPACK panel oper-
ations using parallel cache assignment, ACM Trans. Math. Soft. 39 (4)
(2013) 22:1–22:30.
31
