Supporting mixed-datatype matrix multiplication within the BLIS
  framework by Van Zee, Field G. et al.
Supporting mixed-datatype matrix multiplication within the BLIS
framework
FLAME Working Note #89
Field G. Van Zeea,b, Devangi N. Parikha, Robert A. van de Geijna,b
aInstitute for Computational Engineering and Sciences
bDepartment of Computer Science
The University of Texas at Austin, Austin, TX
{field,dnp,rvdg}@cs.utexas.edu
May 3, 2019
Abstract
We approach the problem of implementing mixed-datatype support within the general matrix multiplication
(gemm) operation of the BLIS framework, whereby each matrix operand A, B, and C may be stored as single- or
double-precision real or complex values. Another factor of complexity, whereby the computation is allowed to take
place in a precision different from the storage precisions of either A or B, is also included in the discussion. We first
break the problem into mostly orthogonal dimensions, considering the mixing of domains separately from mixing
precisions. Support for all combinations of matrix operands stored in either the real or complex domain is mapped
out by enumerating the cases and describing an implementation approach for each. Supporting all combinations
of storage and computation precisions is handled by typecasting the matrices at key stages of the computation—
during packing and/or accumulation, as needed. Several optional optimizations are also documented. Performance
results gathered on a 56-core Marvell ThunderX2 and a 52-core Intel Xeon Platinum demonstrate that high
performance is mostly preserved, with modest slowdowns incurred from unavoidable typecast instructions. The
mixed-datatype implementation confirms that combinatoric intractability is avoided, with the framework relying
on only two assembly microkernels to implement 128 datatype combinations.
1 Introduction
The BLAS [8] defines the general matrix-matrix multiplication (gemm) operation to support any of the following
computations:
C := αAB + βC, C := αABT + βC, C := αABH + βC,
C := αATB + βC, C := αATBT + βC, C := αATBH + βC,
C := αAHB + βC, C := αAHBT + βC, C := αAHBH + βC.
where C is m × n, the left-hand matrix product operand (A, AT , or AH) is m × k, the right-hand matrix product
operand (B, BT , or BH) is k × n, and α and β are scalars.
This matrix multiplication functionality is made available to software developers via the following application
programming interfaces, or APIs:
sgemm( transa, transb, m, n, k, alpha, A, ldA, B, ldB, beta, C, ldC )
dgemm( transa, transb, m, n, k, alpha, A, ldA, B, ldB, beta, C, ldC )
cgemm( transa, transb, m, n, k, alpha, A, ldA, B, ldB, beta, C, ldC )
zgemm( transa, transb, m, n, k, alpha, A, ldA, B, ldB, beta, C, ldC )
1
ar
X
iv
:1
90
1.
06
01
5v
2 
 [c
s.M
S]
  2
 M
ay
 20
19
The first letter of the routine name uniquely encodes the datatype—that is, the domain and precision—of the
matrix and scalar operands as well as the computation: single-precision real (s); double-precision real (d); single-
precision complex (c); and double-precision complex (z). The parameters transa and transb indicate if A and/or B,
respectively, should be computed upon as if they were transposed or conjugate-transposed. The interfaces implicitly
require that matrices be stored in column-major order. Accordingly, the parameters ldA, ldB, and ldC convey the
so-called “leading dimensions” of the arrays A, B, and C, respectively—that is, the number of elements that separate
matrix element (i, j) from element (i, j + 1) in memory.
While this interface has served the HPC community well, it has also become constraining. For example, when
computing tensor contractions (which often resemble matrix multiplications), one may need to refer to a sub-tensor
that cannot be represented with column-major storage without making a temporary copy. Similarly, some situations
may call for conjugating (but not transposing) a matrix operand. Indeed, such functionality is already supported
by the BLIS framework, which exports BLAS-like operations and APIs [28]. However, even BLIS only supports
computation on operands with identical datatypes. Consider the following:
• There exist applications that may wish to update a complex matrix by the product of a complex matrix and
a real matrix. These include applications involving damped response [17, 6], Green’s functions methods [20],
Complex Absorbing Potential (CAP), and Complex Scaling (CS) [15]. Applications in quantum chemistry may
also benefit. These mixed-domain instances of gemm are currently improvised either by casting the operation
in terms of cgemm or zgemm, in which case half of the floating-point operations are superfluous, or by performing
two passes with sgemm or dgemm, which tends to be cumbersome and error-prone, requires extra workspace in
which to make temporary copies of the real and imaginary parts of the complex matrix operands, and likely
yields suboptimal performance.
• Similarly, there exist applications that could benefit from storing matrix operands in different precisions,
and/or computing in a precision that is lower or higher than the storage precision of A and B. These include
NWChem [25, 1] performing CCSD(T) computations [7, 24], and various applications in machine learning [16].
Currently, this must be performed in an ad-hoc manner similar to the mixed-domain case, and with similar
workspace and performance drawbacks.
Thus, there is likely a fair amount of pent-up demand for high-performance implementations to datatype-flexible
BLAS-like APIs.
As alluded to above, the naive approach to supporting mixed-datatype functionality within the gemm operation
comes with obvious memory, performance, and productivity drawbacks: typecasting matrix operands to a common
domain and/or precision outside of the original implementation requires considerable workspace; the memory access
patterns engendered by monolithic casting almost surely acts as a drag on performance; and programming an ad-hoc
solution in terms of the BLAS gemm interfaces sometimes requires non-trivial skills. Indeed, some would find pro-
viding the full combinatoric space of functionality daunting, and in response might attempt to survey the community
and then only implement those cases for which interest was expressed. Instead, our goal from the outset is to imple-
ment support for all cases, and to do so in a manner that delivers high or near-high performance.1 Our approach,
which builds on the BLIS framework, yields a comprehensive reference implementation with which consumers of this
functionality can explore the benefits of mixed-domain and mixed-precision gemm computation without being con-
strained by limitations in the interface, incomplete coverage within the implementation, or unnecessarily inefficient
performance.
1.1 Notation
Our notation should be mostly self-evident to most readers of high-performance dense linear algebra literature. We
use uppercase Roman letters, such as A, B, and C to denote matrices, and lowercase Greek letters α and β to
represent scalars.
1 Understandably, some readers may question the utility of some mixed-datatype cases discussed in this article. Skeptics may argue
that one only needs to focus on the cases that “make sense.” We reason about the issue as follows. While we can identify certain cases
that are today useful to some people or applications, we cannot say with certainty which cases will never be used by any person or
application. And because we cannot a priori identify the cases that will never be needed, we take the position that we must treat all
cases as important enough to merit implementation.
2
4th loop around micro-kernel 
5th loop around micro-kernel 
3rd loop around micro-kernel 
mR 
mR 
1 
+= 
+= 
+= 
+= 
+= 
+= 
nC nC 
kC 
kC 
mC mC 
1 
nR 
kC 
nR 
Pack Ai → Ai 
~ 
Pack Bp → Bp 
~ 
nR 
A Bj Cj 
Ap 
Ai 
Bp 
Cj 
Ai 
~ 
Bp 
~ 
Bp 
~ 
Ci 
Ci 
kC 
L3 cache 
L2 cache 
L1 cache 
registers 
main memory 
1st loop around micro-kernel 
2nd loop around micro-kernel 
micro-kernel 
Figure 1: BLIS refactoring of the GotoBLAS algorithm as five loops around the microkernel [27]. Used with
permission.
Real and complex domains are indicated by R and C, respectively. Occasionally, we refer to the real part of a
matrix or matrix expression X withRe(X) and to the imaginary part with Im(X). In other places, such as where this
notation would be too cumbersome, we use superscripts, such as χr and χi for the real and imaginary components,
respectively, of a scalar χ.
When representing elements within a matrix, we use a subscript to encode the row and column indices. For
example, a scalar α13 would reference the element located in the 2nd row and 4th column of a matrix A.
2
3
2 Background
In this section, we review the approach to matrix multiplication taken within the BLIS framework as well as some
related implementation details that will provide important context to discussions later in this article.
2.1 Matrix Multiplication in BLIS
The GotoBLAS algorithm [9] for performing matrix multiplication underlies the highest-performing BLAS libraries
for current general-purpose microprocessors. The BLAS-like Library Instantiation Software (BLIS) framework, which
implements gemm and other matrix-matrix operations, refactors the GotoBLAS algorithm as pictured in Figure 1.
BLIS isolates the code that needs to be optimized (in assembly code or with vector intrinsics) for different architectures
in a microkernel that updates a very small submatrix, or microtile, of C with a sequence of rank-1 updates that are
accumulated in registers [28]. All other loops and supporting kernels are implemented portably in C99. By contrast,
Goto’s implementation—also adopted by the OpenBLAS fork [21] of the GotoBLAS library—casts the computation
into a larger assembly-coded kernel. This larger unit of code, which corresponds to what BLIS refers to as the
macrokernel, consists of the microkernel plus the logic that falls within the first two loops around the microkernel.
A key element of the GotoBLAS algorithm is that high-performance implementation of gemm incorporates the
packing of submatrices of B (into buffer B˜) and of A (into buffer A˜) to improve data locality during the execution of
the microkernel.3 This feature has been used in the past to consolidate other functionality into the same framework:
implementation of other matrix-matrix operations (level-3 BLAS) [10, 28], fusing sequences of matrix operations of
importance to Machine Learning [30], and implementation of practical Fast Matrix Multiplication (Strassen-like)
algorithms [13, 14]. A final insight comes from Van Zee [26], in which it is shown how complex matrix multiplication
can be cast in terms of only microkernels designed for real domain gemm without a significant performance penalty.
Given a target architecture, instantiating the traditional functionality of gemm with BLIS requires only two
microkernels, one each for single- and double-precision real domain computations, with the insight from [26] inducing
the functionality typically provided by complex domain microkernels. It also requires packing functions, which
by default take the form of architecture-agnostic (C99) implementations provided by the framework, as well as
architecture-specific cache and register blocking parameters. BLIS exposes several other configure-time options,
though they all default to values that typically need no further tweaking.
2.2 Managing complexity in BLIS
The combinatoric complexity of the gemm operation in BLIS is mitigated in several ways.
2.2.1 Storage
BLIS tracks separate row and column strides for each matrix object.4 Using these two stride parameters, BLIS
supports three matrix storage formats in its end-user APIs: row-major storage (where the column stride is unit);
column-major storage (where the row stride is unit); and general storage (where neither stride is unit).5 Each gemm
operand in BLIS may individually be stored in any of the three aforementioned storage formats. If we consider
all possible variations of general storage as a single format, this results in a total of 33 = 27 different storage
combinations. The gemm operation supports these 27 storage combinations as follows: the packing function is
written generically to allow it to read from any of the three storage formats when reading matrices A and B (during
the packing of A˜ and B˜), and the microkernel is required to handle input/output of C in any of the three supported
formats.
2 Our subscript notation starts counting from 0.
3 This reorganization of matrices A and B can also improve TLB performance [11].
4 In BLIS, the row stride—like the leading dimension in row-major storage—expresses the number of elements that separate matrix
element (i, j) and element (i+ 1, j) in memory. The column stride—like the leading dimension in column-major storage—expresses the
number of elements that separate elements (i, j) and (i, j + 1).
5 We often refer to row-major matrices as being row-stored, and column-major matrices as being column-stored.
4
2.2.2 Transposition
BLIS easily accommodates transposition of matrices A or B by swapping the row and column strides (and their
corresponding dimensions) just prior to packing. This technique merely affects how the matrices are traversed rather
than how they are stored; thus, we call this an “induced” (or logical) transposition, as no matrix elements are actually
copied or moved.
2.2.3 Conjugation
In the case of complex matrices, optional conjugation6 of A and B is handled during the packing into A˜ and B˜.
2.2.4 Multithreaded parallelization
The authors of [22] discuss how BLIS exposes many loops, each of which can be parallelized. Crucially, the com-
plexity over matrix storage datatypes (domain and precision), transposition/conjugation parameters (transA and
transb), and matrix storage formats (row, column, generalized) are orthogonal to the issues pertaining to extracting
multithreaded parallelism from gemm. Therefore, the insights of [22] carry over to the parallelization when mixing
domains and/or precisions.
2.2.5 Optimizing input/output on C
High-performance microkernels accumulate their intermediate results using vector registers. Thus, the microkernel
author must decide whether to semantically assign the vector registers to contain contiguous rows or contiguous
columns of the microtile submatrix. We refer to this as the microkernel’s register orientation. Interestingly, the
register orientation necessarily biases the microkernel towards loading and storing elements of C as either rows or
columns, since performing IO on elements in the opposite orientation would require a sequence of costly permutation
instructions. BLIS tracks this intrinsic property, or IO preference, of the microkernel so that the framework can
transform the matrix problem to the microkernel’s liking. For example, if our microkernel is row-preferential and
the gemm implementation is executed on a column-stored matrix C, BLIS will employ a high-level transposition of
the entire operation (to CT := BTAT ) so that, from the perspective of the microkernel, CT appears to be stored in
its preferred format—that is, a manner consistent with its vector register orientation.7 Thus, regardless of whether
the microkernel is row- or column-preferential, the gemm implementation will, generally-speaking, yield similar
performance on row- and column-stored matrices C.8
2.3 For the busy reader
We acknowledge that some readers may wish for a very short synopsis of the insights presented later in this article.
We sum up the techniques that allow implementing all cases of mixed-domain and mixed-precision gemm within the
BLIS framework as follows: The mixing of domains can be handled by enumerating the eight cases (six of them new),
which largely reduce to either manipulating matrix metadata, and/or exposing the real and imaginary elements in
a complex matrix in such a way that a real matrix multiplication may be performed to induce the desired result, in
part motivated by insights from the 1m method [26]. The mixing of precisions can be handled by typecasting between
precisions, as needed, during the packing A and B (into A˜ and B˜), while the typecasting during the accumulation
of the intermediate product AB may occur in special code wrappers that update C appropriately.
6 In BLIS, all input matrix operands to gemm and most other operations may be conjugated without transposition, which corresponds
to a new trans parameter value absent in the BLAS.
7 If, for whatever reason, this optimization is not employed, the microkernel in this example would use the general storage case to read
and write to a column-stored C, which would incur a small performance penalty due to an increased number of assembly instructions.
8 Typically, the general-storage case in the microkernel must be handled separately from the contiguous row- or column-storage case
that is preferred. This case usually incurs a performance penalty that is mostly unavoidable due to decreased spatial locality and an
increased number of assembly instructions.
5
3 API Considerations
In the spirit of the BLAS API, a more complete interface that supports this richer, mixed-datatype environment
could take the form of
gemm( m, n, k, alpha, transA, A, domainA, precA, rstrideA, cstrideA,
transB, B, domainB, precB, rstrideB, cstrideB,
beta, C, domainC, precC, rstrideC, cstrideC,
precAB )
Figure 2: A hypothetical BLAS-like API for mixed-domain, mixed-precision gemm.
where transX indicates whether X should be computed upon as if it were (optionally) transposed, conjugate-
transposed, or conjugated without transposition; X is the address where matrix X is stored; domainX indicates
whether X is stored as a matrix of real or complex elements; precX indicates the precision in which elements of
X are stored (e.g., half-, single-, double-, or quad-precision); rstrideX and cstrideX indicate the row and column
strides, respectively, for storing9 X; and precAB indicates the precision in which the matrix multiplication takes place
(possibly implying promotion or demotion from the storage precision of either A or B). Note that column-major
order and row-major order are the special cases where rstrideX is unit and cstrideX is unit, respectively.
The hypothetical API shown in Figure 2 plainly exposes all of the dimensions of functionality along which the
library developer must provide implementations. However, we feel that such an interface is not very useful towards
inspiring those implementations, as it subtly nudges the developer towards extending the use of separate interfaces
for each parameter combination down the function stack, leading to solutions that are vertically siloed from each
other, even if various subsets share many similarities.
BLIS preempted this problem by starting with an object-based foundation for encoding and expressing matrix
(and vector) operands. Each linear algebra entity (such as a matrix or vector) is encapsulated within a data structure,
or more specifically, a custom struct type. For example, BLIS currently exports the following object-based function
prototype for invoking the gemm operation:
void bli_gemm( obj_t* alpha, obj_t* a, obj_t* b, obj_t* beta, obj_t* c );
Figure 3: Function for computing gemm provided by the object-based API in the BLIS framework.
The function bli gemm() takes five arguments of type obj t*, each of which corresponds to the address of a struct
representing the floating-point operands traditionally passed into the gemm operation. The function exposes no
other arguments, because all of the conventional parameters (such as transA and transB) may be interpreted as
properties of one of the floating-point operands.
A simplified version of the obj t type definition may be given as:
Here, dim t, inc t, doff t, siz t, and objbits t represent various integer types defined within BLIS for representing
dimensions, strides and increments, diagonal offsets, byte sizes, and object property bitfields, respectively. The
idea behind the struct example in Figure 4 is that matrices may be represented by a collection of properties, or
metadata, and that these properties may be set—for example, when the object is initialized and its underlying data
buffer is allocated—and then subsequently queried or modified using a collection of object-based accessor functions.
Encapsulating matrix properties within objects helps hide details that need not be exposed at certain levels of the
implementation.
The key observation to make now is that the domainX and precX arguments shown in Figure 2 can be completely
hidden within the object API of BLIS. Indeed, the current definition of obj t within the framework already includes
domain and precision bits within the info bitfield. We only need to add an additional parameter, or designate
additional bits within the info property, to support the computation precision (labeled in Figure 2 as precAB).
Thus, it is possible to add mixed-datatype support to gemm without any modification to the function interface to
bli gemm().
9 Generally speaking, we consider these separate strides to support three storage formats—column-major, row-major, and so-called
generalized storage (where neither the row stride nor column stride is unit).
6
typedef struct obj_s
{
dim_t offset_m, offset_n;
dim_t dim_m, dim_n;
inc_t rstride, cstride;
doff_t diag_off;
siz_t elem_size;
objbits_t info;
char* buffer;
// Other fields as necessary...
} obj_t;
Figure 4: A simplified definition for the struct underlying the obj t type used within BLIS.
A more thorough walkthrough of BLIS’s object API is well beyond the scope of this article.10 The main takeaway
from this discussion is that the original author of BLIS designed the framework around an object-based core with
the keen understanding that additional APIs of arbitrary format, including (but not limited to) those in the style
of the BLAS, could always be layered on top of this more general abstraction. Consequently, any such APIs built
above and beyond the underlying object layer are only incidental to the framework; they merely constitute syntactic
re-expressions of some subset of the functionality made possible by the object foundation.11
4 Supporting Mixed Domain Computation
We consider the storage domain (real or complex) of the matrix to be orthogonal to the storage precision (single,
double, etc). In this section, we consider how to handle mixing matrix operands of different domains. For now,
the reader should assume that the storage precision is held constant across all matrix operands, and therefore can
be ignored. We also ignore scalars α and β for the time being, which simplifies the general matrix multiplication
operation to C := AB + C.
4.1 The 1m method
The author of [26] recently presented a novel method of computing complex matrix multiplication without relying
upon kernels that explicitly perform complex arithmetic at the scalar level, as is typically the case in high-performance
BLAS libraries. Instead, the so-called 1m method relies only upon matrix primitives (kernels) that compute real
matrix multiplication. And unlike the older and more easily understood 4m method [27], 1m replaces each complex
matrix multiplication with only a single real matrix multiplication.
The key to 1m is a pair of special packing formats, which Van Zee denotes 1e and 1r. The author illustrates
the role of these two packing formats using the following example of complex matrix multiplication C += AB where
m = 3, n = 4, and k = 2.
γr00 γ
r
01 γ
r
02 γ
r
03
γi00 γ
i
01 γ
i
02 γ
i
03
γr10 γ
r
11 γ
r
12 γ
r
13
γi10 γ
i
11 γ
i
12 γ
i
13
γr20 γ
r
21 γ
r
22 γ
r
23
γi20 γ
i
21 γ
i
22 γ
i
23

+=

αr00 −αi00 αr01 −αi01
αi00 α
r
00 α
i
01 α
r
01
αr10 −αi10 αr11 −αi11
αi10 α
r
10 α
i
11 α
r
11
αr20 −αi20 αr21 −αi21
αi20 α
r
20 α
i
21 α
r
21


βr00 β
r
01 β
r
02 β
r
03
βi00 β
i
01 β
i
02 β
i
03
βr10 β
r
11 β
r
12 β
r
13
βi10 β
i
11 β
i
12 β
i
13
 (1)
10 Curious readers may find tutorial-like example codes included alongside the BLIS source code, which is primarily distributed via
GitHub [4]. Markdown documentation is also made available, and may be conveniently rendered via GitHub with modern web browsers.
11 The BLAS API provided by BLIS serves as a classic example of this kind of layering, as it builds on the object API to arrive an an
interface that exactly mimics the BLAS, even if doing so precludes access to functionality and features that would otherwise be available.
7
In this example, it is assumed that matrix C is column-stored, which prescribes that the 1e format be applied to A
and the 1r format be applied to B. This can be confirmed by inspection: applying a real matrix multiplication to
the left- and right-hand matrix product operands would not correctly compute the complex matrix multiplication
if C were row-stored because the intermediate elements of AB would update the wrong elements of C. However,
1m provides a cure to this situation. Namely, symmetry in the method allows for a row-oriented variant where C is
row-stored.
γr00 γ
i
00 γ
r
01 γ
i
01 γ
r
02 γ
i
02
γr10 γ
i
10 γ
r
11 γ
i
11 γ
r
12 γ
i
12
γr20 γ
i
20 γ
r
21 γ
i
21 γ
r
22 γ
i
22
γr30 γ
i
30 γ
r
31 γ
i
31 γ
r
32 γ
i
32
+=

αr00 α
i
00 α
r
01 α
i
01
αr10 α
i
10 α
r
11 α
i
11
αr20 α
i
20 α
r
21 α
i
21
αr30 α
i
30 α
r
31 α
i
31


βr00 β
i
00 β
r
01 β
i
01 β
r
02 β
i
02
−βi00 βr00 −βi01 βr01 −βi02 βr02
βr10 β
i
10 β
r
11 β
i
11 β
r
12 β
i
12
−βi10 βr10 −βi11 βr11 −βi12 βr12

(2)
In this case, the application of the packing formats is reversed, such that 1e is applied to B and 1r is applied to A.
Van Zee later points out that it is not actually the storage of C that determines whether Eq. 1 or 2 is employed, but
rather the SIMD register orientation of the underlying real domain microkernel—which determines the input/output
preference of the microtile and thus the natural method of performing SIMD reads and write instructions on C.
While the 1m method was originally articulated as a way to implement complex domain matrix multiplication
on operands of identical datatypes, we will soon show that not only can it be extended to mixed-precision complex
domain computation, but it also indirectly supports a particular combination of mixed-domain operands.
4.2 Enumerating the cases
Consider for simplicity C := AB+C, ignoring the scaling factors α and β. Each of {A,B,C} can be stored in either
the real or complex domains, leading to 23 = 8 different combinations. Figure 5 enumerates and names each possible
case. We will now discuss each case, how it is interpreted, and how it is implemented within the BLIS framework.
4.2.1 Cases 0, 3
The trivial case where all matrices are stored in the real domain, which we refer to as Case 0, is already supported by
the framework via algorithms based on real domain microkernels. Similarly, Case 3, which applies when all matrices
are stored in the complex domain, is also already supported. Support for Case 3 is provided in BLIS via conventional
algorithms based on complex domain microkernels as well as via the 1m method, which is particularly useful when
complex microkernels are not available. Cases 0 and 3 incur 2mnk and 8mnk flops, respectively.
4.2.2 Cases 1a, 1b
Case 1a captures situations where C and B are real while A is complex. We interpret such an operation as C :=
Re(A)B + C. Implementing this case in BLIS is rather straightforward: we ignore the imaginary part of A and
compute as if all matrices were real. Because BLIS tracks both row and column strides for each matrix operand,
ignoring the imaginary elements amounts to a temporary change to the dimension and stride metadata contained
within the object representing A. Case 1b involves a complex matrix B and real matrices C and A, but is otherwise
handled similarly. Since these case are ultimately implemented in terms of Case 0, they both performs 2mnk flops.
4.2.3 Case 2ab
Case 2ab is applicable when A and B are complex while the matrix to which they accumulate, C, is real. We
interpret this somewhat curious scenario as a matrix product that takes place in the complex domain, but one for
which the imaginary result is discarded: C := Re(AB) + C. Since Im(AB) is not needed, only 4mnk flops need
to be performed. Thus, this case provides an opportunity for computational savings when properly implemented.
BLIS implements 2ab by borrowing the 1r packing format used by the 1m method.12 Specifically, BLIS packs
both matrices A and B using the 1r format while simultaneously conjugating B. This has the effect of allowing a
12 This is the indirect support alluded to at the conclusion of Section 4.1.
8
Case C A B Description/Approach
0 R R R Supported by the original framework. Performs 2mnk flops.
1b R R C Interpreted as C := ARe(B) + C. Ignore Im(B) and compute as
Case 0. Performs 2mnk flops.
1a R C R Interpreted as C := Re(A)B + C. Ignore Im(A) and compute as
Case 0. Performs 2mnk flops.
2ab R C C Interpreted as C := Re(AB)+C. Use 1r packing format from 1m
method to compute only Re(AB). Performs 4mnk flops.
1c C R R Compute AB and accumulate into Re(C). Performs 2mnk flops.
2bc C R C
Compute as if A ∈ C, but avoid all computations with Im(A).
Represent C and B with real and imaginary elements indistin-
guishable within a m× 2n and k × 2n real matrices, respectively.
Requires a row-preferential microkernel. Performs 4mnk flops.
2ac C C R
Compute as if B ∈ C, but avoid all computations with Im(B).
Represent C and A with real and imaginary elements indistin-
guishable within 2m × n and 2m × k real matrices, respectively.
Requires a column-preferential microkernel. Performs 4mnk flops.
3 C C C
Supported by the original framework (via the the 1m method
and/or assembly-coded complex microkernels). Performs 8mnk
flops.
Figure 5: A table summarizing the eight possible cases of mixed-domain computation within the gemm operation.
The first column identifies a name for each case, with the number identifying the number of complex matrix operands
and the letters identifying the matrices that are complex. The second, third, and fourth columns explicitly identify
the domains of each matrix operand. The last column describes the interpretation of each case within BLIS along a
brief comment on how the case is implemented (where applicable) and a (minimum) flop count when implemented
optimally.
subsequent real matrix multiplication over the packed matrices to correctly compute only the real half of the complex
matrix multiplication update.
4.2.4 Case 1c
The opposite of 2ab—Case 1c—refers to settings in which matrix C is complex while A and B are real. Since the
matrix product takes place entirely in the real domain, the natural interpretation is that AB updates only Re(C),
and Im(C) is left untouched. Generally speaking, BLIS implements 1c using a strategy similar to the one used with
1a and 1b. That is, a temporary change to the object metadata describing matrix C allows us to isolate Re(C),
which once again reduces the problem to Case 0. Accordingly, this case requires only 2mnk flops.
4.2.5 Cases 2ac, 2bc
Consider Case 2ac, in which matrices C and A are complex and matrix B is real. We interpret this situation as
performing a complex matrix product AB to update both real and imaginary parts of C. However, all computation
involving the imaginary part of B, which is implicitly zero, may be ignored. This means that the computation
requires only 4mnk flops. Now, if C and A were guaranteed to be column-stored, BLIS could handle this case with a
simple change of metadata that recasts those complex matrices as real, with the real and imaginary elements treated
equally and indistinguishably. However, BLIS also allows row storage (and general storage) for all matrices, and thus
the solution is not quite so simple. Instead, BLIS handles 2ac as follows.
9
If the original problem fits into Case 2ac and the microkernel is column-preferential, A is packed as a real matrix
(with imaginary elements stored traditionally, in element-wise interleaved fashion) and B is packed normally. A
real domain macrokernel (Case 0) is then executed, which will properly update C. However, if the microkernel is
row-preferential, the operation is logically transposed and Case 2bc is executed instead (whereby matrices C and B
are complex and A is real). Thus, the effective case employed, 2ac or 2bc, depends on the register orientation of
the microkernel, not the storage of C, and does so for the same reason that the 1m method, discussed previously in
Section 4.1, depends on the same property.
After the aforementioned logical transposition is applied (or not), the microkernel input/output preference may
differ from the storage of matrix C. If this is the case, then BLIS calls a virtual microkernel13, instead of calling
the microkernel directly, allowing for logic that will use a very small amount of temporary storage—equivalent to
one microtile—in order to facilitate the proper use of the microkernel (whether it be row- or column-preferential)
and then copy and/or accumulate the temporary microtile result back to the appropriate location within the output
matrix C.
4.3 Computation domain
Unlike the computation precision, which is discussed in the next section, the computation domain is implied according
to the case-specific interpretations covered in Section 4.2 and summarized in Figure 5. Alternate interpretations exist,
however. For example, the computation of Case 2ab could be interpreted as taking place in the real domain, which
would result in Im(A) and Im(B) being ignored. Similar interpretations—all of which would change the update
to C—could be applied to Cases 2ac, 2bc, or even 3. However, we do not immediately see significant utility in
exposing these cases.14 Thus, our implementation in BLIS does not presently allow the caller to explicitly specify
the computation domain.
5 Supporting Mixed Precision Computation
Now that variation among the storage domains has been fully explored, we turn our attention to the storage precision
of the matrices A, B, and C. Once again, we set aside the scalars α and β, focusing only on variation among matrix
operands. We also limit the initial discussion to variation within the set of precisions that includes only single and
double precision.
5.1 Supporting all cases
Each of {A,B,C} can be stored in single or double precision. Furthermore, we define a separate computation precision
to identify the precision in which the matrix product takes place. Combining the storage precision of each matrix
with the computation precision, we find that there exist 24 = 16 different cases.
At first glance, it may seem worthwhile to enumerate all mixed-precision cases as we did for mixed-domain
computation in Section 4. However, there is a more concise and systematic way of describing how to support all
cases, one that happens to coincide closely with how mixed-precision support was ultimately implemented in BLIS.
While not part of the runtime logic for implementing mixed-precision computation, we must first modify the
packing facility so that the source and destination precisions may differ. For example, we must be able to pack from
a single-precision real matrix to a double-precision real matrix, or vice versa. Once the mixed-precision functionality
is in place for the packing operation, the runtime logic may be encoded in three broad steps, as follows.
13 In BLIS, virtual microkernels share the same type signature of conventional (“native”) microkernels and ultimately compute the
same operation. The only difference is that virtual microkernels typically implement the microkernel operation in the form of additional
logic before (and sometimes after) the call to the native microkernel. Sometimes, as with the 1m method, the native microkernel being
called is the real-domain equivalent relative to the virtual microkernel’s type signature (e.g., a virtual zgemm() microkernel implemented
in terms of the native dgemm() microkernel). Thus, the term is somewhat general-purpose and additional context is needed to identify
its specific nature.
14 If an intrepid user wished to access such functionality, it could easily be done currently with BLIS by manipulating the matrix object
metadata accordingly. Of course, if users show interest in this functionality, we will reconsider official support within the framework.
10
5.1.1 Identify the computation precision
First, we must identify the computation precision. In BLIS, we provide a field for the computation precision within
the metadata of any matrix object. However, semantically, we deem only the field within the object for matrix
C to be relevant. Thus, upon calling the gemm operation, we query the computation precision from C. Note
that in BLIS, when objects are created, the computation precision is initialized to be equal to the object’s storage
precision. Consequently, if it is not explicitly set by the caller prior to invoking gemm, the computation precision
will automatically default to the storage precision of C.
5.1.2 Construct the target datatypes for A and B
Next, we embed target datatypes within the metadata for objects representing matrices A and B. BLIS defines the
target datatype for A and B as being the storage datatype (domain and precision) of the matrix during its packed
state—in other words, the datatype to which the matrix must be typecast before it can be computed upon.15 The
target datatype for matrix A is constructed by combining the storage domain of A with the operation’s computation
precision, with a similar process resulting in the target datatype for matrix B. The target datatype for each matrix
is then embedded within the metadata of the corresponding object.
5.1.3 Determine whether typecasting is needed on accumulation
If the computation precision differs from the storage precision of C, then the intermediate result from computing the
product AB must be typecast before it is accumulated into C. This may be implemented outside the microkernel
by implementing additional macrokernels that write the microkernel result to a temporary microtile (allocated on
the function stack) before typecasting and accumulating the temporary values back to C. Alternatively, this logic
may hidden within a virtual microkernel. BLIS opts for the former solution, which somewhat reduces function call
overhead at the cost of a somewhat higher object (binary) code footprint.
If the computation precision is identical to the storage precision of C, then AB does not require any typecasting
before being accumulated. This corresponds to use of the traditional macrokernel.
5.2 Using the 1m method for Case 3
As alluded to in the closing of Section 4.1, the 1m method can be extended to encompass all combinations of
mixed-precision operands—that is, all precision combinations that fall within mixed-domain Case 3. This amounts
primarily to (1) adding the ability to pack to the 1e and 1r16 formats when the target datatype differs from the
storage datatype and (2) adding the ability to scale by α (when αi 6= 0) during mixed-precision packing of A and
B. With these changes in place, the 1m virtual microkernel may be used as-is since its functioning is undisturbed
by the typecasting logic encoded in the additional macrokernels mentioned previously in Section 5.1.3.17
5.3 Summary
By addressing the three areas described in Sections 5.1.1 through 5.1.3, and adding support for typecasting within
the packing function, we can handle all 16 cases of mixing single and double precisions. Similarly, relatively minor
changes to BLIS’s implementation of the 1m method enable all mixed-precision instances of Case 3 to be handled
even if conventional, assembly-coded complex microkernels are unavailable.
Interestingly, the hypothetical impact of adding support for an additional floating-point precision, such as half-
precision, would manifest almost exclusively in the form of additional support to the packing function.18 The
mixed-precision runtime logic, described above, would then trivially extend to support the two additional datatypes
(one each for real and complex domains).
15 Note that we do not need to track the target datatype for C since the storage datatype of C does not change in the course of the
mixed-datatype gemm operation.
16 The ability to typecast while packing to the 1r format is also required by mixed-precision instances of Case 2ab.
17 In the course of our work, the BLIS testsuite was updated to allow testing of its 1m method implementation with mixed-precision
operands.
18 A real domain microkernel that performs computations in the new precision would also be needed.
11
6 Optimizations
After retrofitting the mixed-domain/mixed-precision functionality into the gemm implementation, it is possible to
apply various optimizations to certain cases. In this section, we briefly explore some of these optimizations.
6.1 Avoid virtual microkernel overhead with two microkernels
If and when BLIS adopts a regime whereby each hardware architecture is supported simultaneously by two micro-
kernels per datatype, one with a row preference and one with a column preference, then the virtual microkernel logic
becomes unnecessary, and both 2ac and 2bc may be fully implemented without additional overhead.
6.2 Avoid non-contiguous/non-SIMD access on C
Some mixed-domain cases—at least as described in Section 4.2—result in accessing complex matrix C by individual
real and imaginary elements. The best example of this is Case 1c, in which Re(C) is updated by the product of
real matrices A and B. However, in order to isolate Re(C), the metadata describing matrix C must be tweaked in
such a way that the matrix then has non-unit stride in both dimensions. While BLIS microkernels may update such
matrices, doing so on current hardware comes with an unavoidable performance penalty that does not manifest when
accessing contiguous elements via SIMD load and store instructions. Thus, it may be advantageous to internally
allocate a temporary m × n matrix Ctemp in which to compute AB, after which the result is copied back to C. As
long as Ctemp is created (a) in the real domain and (b) as either row- or column-stored, the microkernel will be able
to update Ctemp efficiently with SIMD instructions.
This optimization tends to be most worthwhile when k > kC , as it would imply that the computation of AB
unfolds as multiple rank-kC updates of C (that is, multiple iterations of the 4th loop around the microkernel), with
the non-contiguous load/store penalty otherwise being incurred for each update.
6.3 Reduce typecasting costs and round-off error accumulation
When the storage precision of C differs from the computation precision, the gemm implementation executes typecast-
ing instructions emitted by the compiler in order to properly convert from one floating-point datatype to another—
single- to double-precision or double- to single-precision. These instructions can be costly, especially when k > kC ,
since each element of C is typecast once per rank-kC update. In addition to the performance cost of these typecasting
instructions, they can also incur a numerical cost. Specifically, when the storage precision of C is lower than the
computation precision, repeated round-off error can occur during accumulation of the intermediate matrix product,
which is once again exacerbated for large values of k. But as with the repeated cost incurred from non-contiguous
access described in Section 6.2, these costs can be avoided by allocating a temporary m× n matrix Ctemp. The key
difference is that, in this case, Ctemp is created with its storage precision equal to the computation precision, which
will avoid intermediate typecasting (and thus reduce round-off error), leaving only typecasts on input (Ctemp := C)
and on output (C := Ctemp). The total number of typecasts needed may be further reduced to those on output
provided that Ctemp is initialized to zero and the final AB product be accumulated, rather than copied, back to C.
6.4 Avoid virtual microkernel overhead with Ctemp
Notice that Cases 2ac and 2bc, as described in Section 4.2, may require use of a virtual microkernel if a row-
preferential microkernel (needed by 2bc) must be used on a column-stored (or general stored) matrix, or if a
column-preferential microkernel (needed by 2ac) must be used on a row-stored (or general stored) matrix. The
overhead of the virtual microkernel, while small, may still be noticeable and is incurred for each rank-kC update.
The dual-microkernel strategy described in Section 6.1 solves this issue. However, if only a single microkernel (per
datatype) is available, a temporary m×n matrix Ctemp may be allocated, with the important distinction being that
Ctemp is created with storage (by columns or rows) to match the preference of the available microkernel. This allows
the implementation to avoid the virtual microkernel altogether during intermediate accumulation into Ctemp.
12
6.5 Summary
The optimizations described above are optional. At the time of this writing, BLIS implements all except the dual-
microkernel strategy described in Section 6.1. BLIS also allows the the user to optionally disable all uses of Ctemp
at configure-time, which avoids the extra workspace that would otherwise be needed by the optimizations discussed
in Sections 6.2 through 6.4.
7 Handling scalars
Before concluding our discussion of how to implement and support mixed-datatype gemm, we turn our attention to
scalars α and β, which have been omitted from our discussion thus far.
7.1 Mixed precision
If the precision of α differs from the computation precision, a decision must be made as to how to proceed. Numerous
possible policies exist for handling such situations. Three examples follow:
1. Typecast α to match the computation precision.
2. Typecast the computation precision to match that of α.
3. Unconditionally promote the lower precision value to the precision of the other (higher precision) value.
Our mixed-datatype extension to BLIS currently opts for option (1).
A similar decision must be made for handling the precision of β. The choices here are even more numerous, because
while we consider the precision of β and the storage precision of C to be the main inputs to the runtime logic, one
could also argue for considering the disagreement with the computation precision that governs the computation of
AB. A few possible policies are:
1. Typecast β to match the storage precision of C.
2. Perform the scaling βC in the higher precision of the two values, typecasting back to the storage precision of
C, as necessary.
3. Typecast both β and C to the computation precision so that all suboperations within βC + AB can occur in
the same precision before being typecast back to the storage precision of C.
Once again, our solution in BLIS opts for the relatively simple solution alluded to in (1), with αAB being typecast
to the storage precision on accumulation to C.
7.2 Mixed domain
Real values of β are easily handled for all eight cases enumerated in Section 4.2.
For Cases 0, 1a, 1b, and 1c, complex β may be projected into the real domain (which discards Im(β) entirely)
since, with C ∈ R, Im(β) cannot change the final result. Similarly, complex values of β are handled as expected in
the four cases where C ∈ C. Specifically, Case 3 already handles complex β while Cases 2ab, 2ac, and 2bc support
Im(β) 6= 0 via extra logic in the virtual microkernel.
Real and complex values of α are already handled in Cases 0 and 3, respectively. Real α are also already handled
by Case 3 since R ⊂ C. In Case 0, a complex α can be projected to the real domain since Im(α) would not change
the computation, even if we had storage in which to save the final result.
For Cases 1a, 1b, 1c, 2ab , 2ac, and 2bc, a non-zero imaginary component in α could presumably change the
final computation under a literal interpretation of the computation—that is, one in which all five operands’ domains
are taken at face value. However, implementing this logic is non-trival. For example, consider our approach to
handling Case 1a, as discussed in Section 4.2. In this case, we perform the computation according to Case 0, as if
the imaginary components of A were zero. That case’s handling is completely consistent with its mathematics, since
in that scenario A is the only operand with non-zero imaginary values, and thus they would have no impact on the
13
final result. However, if both A and α are complex, then the imaginary components could combine to change the
real component of the scalar-matrix product αA. Adjusting for this new possible use case would require a different
approach in the implementation, perhaps using temporary workspace to store a copy of A while it is scaled by
α, after which the imaginary components may be ignored (assuming they were even computed to begin with).19
However, while it is clear that going through such motions would maintain deeper fidelity to the literal mathematics
expressed in the mixed-domain scenario, it’s not clear to us that this additional functionality would be vital for most
applications. As we continue to solicit feedback from the community, we will pay close attention to whether users
expect or request support for non-zero imaginary values of α in Cases 1a, 1b, 1c, 2ab , 2ac, and 2bc. For now, our
mixed-datatype solution supports only real values of α for those six cases, and prints an error message if the scalar
is given with a non-zero imaginary component.
8 Performance
In this section, we discuss performance results for our mixed-datatype implementations on two servers with modern
hardware architectures.
8.1 Platform and implementation details
8.1.1 Marvell ThunderX2
The first system upon which we measured performance is a single compute node consisting of two 28-core Marvell
ThunderX2 CN9975 processors.20 Each core, running at a clock rate of 2.2 GHz, provides a single-core peak per-
formance of 17.6 gigaflops (GFLOPS) in double precision and 35.2 GFLOPS in single precision. Each of the two
sockets has a 32MB L3 cache that is shared among its local cores, and each core has a private 256KB L2 cache and
32KB L1 (data) cache. The installed operating system was Ubuntu 16.04 running the Linux 4.15.0 kernel. Source
code was compiled by the GNU C compiler (gcc) version 7.3.0.21
8.1.2 Intel Xeon Platinum
The second system is a single node consisting of two 26-core Intel Xeon Platinum 8167M processors.22 Each core
ran at a clock rate of 2.0 GHz, providing single-core peak performance of 64 gigaflops (GFLOPS) in double precision
and 128 GFLOPS in single precision. Each of the two sockets has a 35.75MB L3 cache that is shared among its local
cores, and each core has a private 1MB L2 cache and 32KB L1 (data) cache. The installed operating system was
Ubuntu 18.04.1 running the Linux 4.15.0 kernel. Source code was compiled by the GNU C compiler (gcc) version
7.3.0.23
8.1.3 Implementations
On both the ThunderX2 and the Xeon Platinum, the version of BLIS used was based on an inter-version release
that preceded 0.5.1.24 In both cases, BLIS was configured with OpenMP-based multithreading. Architecture-specific
configuration, which determines settings such as kernel sets and cache blocksizes, was performed automatically via
the auto target to the configure script.
We showcase two (or, in some cases, three) implementations, with a third (or fourth) provided for reference:
19 Alternatively (and more preferably), logic that packs Re(αA) where α,A ∈ C may be encoded within a special packing function.
20 While four-way symmetric multithreading (SMT) was available on this hardware, the feature was disabled at boot-time so that the
operating system detects only one logical core per physical core and schedules threads accordingly.
21 The following optimization flags were used during compilation on the ThunderX2: -O3 -ftree-vectorize -mtune=cortex-a57. In
addition to those flags, the following flags were also used when compiling assembly kernels: -march=armv8-a+fp+simd -mcpu=cortex-a57.
22 Two-way SMT (which Intel refers to as “Hyperthreading”) was available. However, we employed processor affinity settings that
limited the operating system to utilizing only one logical core per physical core.
23 The following optimization flags were used during compilation on the Xeon Platinum: -O3. In addition to those flags,
the following flags were also used when compiling assembly kernels: -mavx512f -mavx512dq -mavx512bw -mavx512vl -mfpmath=sse
-march=skylake-avx512.
24 This version of BLIS may be uniquely identified, with high probability, by the first 10 digits of its git “commit” (SHA1 hash)
number: cbdb0566bf.
14
• Internal without extra memory. This refers to the BLIS implementation described in Sections 4 and 5
in which all logic for supporting the mixing of domains and precisions occurs opaquely inside of BLIS. This
implementation does not, however, employ the use of a temporary matrix Ctemp discussed in Sections 6.2–6.4.
• Internal with extra memory. This implementation is identical to the previous implementation, except that
Ctemp is made available, thus incurring extra workspace requirements in certain situations. It is worth pointing
out that this implementation does not differ from its extra-memory-avoiding counterpart for all 128 mixed-
datatype cases. Thus, we will only present these results when one of the conditions laid out in Sections 6.2–6.4
is applicable. Also, note that we do not employ any cache blocking or parallelism outside of the underlying call
to gemm, such as when copying/accumulating Ctemp to/from C.
• Ad-hoc. This refers to an implementation that is formulated outside of BLIS, using temporary workspace and
matrix copies wherever needed. The purpose behind such an implementation is to show the best a knowledgeable
computational scientist could reasonably expect to achieve using only a BLAS library. Thus, while we link this
solution to BLIS, we do so via the framework’s BLAS compatibility layer. Furthermore, we do not employ any
cache blocking or parallelism outside of the underlying call to gemm.
• Reference. This refers to the sgemm(), dgemm(), cgemm(), and zgemm() provided by BLIS, whichever is
appropriate, as determined by the mixed-datatype case presented by each graph. These Reference curves are
provided as a visual “high-water mark” to show how much performance is ceded by the Internal and Ad-hoc
mixed-datatype implementations. As of this writing, we had not yet developed native complex domain gemm
microkernels, and therefore the cgemm() and zgemm() curves represent BLIS implementations based on the 1m
method.
8.2 Results
8.2.1 Scope and Conventions
Performance data for sequential, multithreaded within one socket (28 threads), and multithreaded across two sockets
(56 threads) was gathered on the ThunderX2. Similarly, sequential, single-socket (26 threads), and dual-socket (52
threads) data was gathered on the Xeon Platinum. This produced a rather large set of data, which we formatted into
768 individual graphs. As a practical matter, we have relegated this complete set of performance results to Online
Appendix A. Here, in the main body of the article, we limit our presentation to a slice of data that we feel is broadly
representative of the full set.
For any given graph, the x-axis denotes the problem size (where m = n = k), the y-axis shows observed floating-
point performance in units of GFLOPS per core, and the theoretical peak performance of the hardware coincides
with the top of the graph. Problem sizes for sequential instances of gemm were run from 40 to 2000 in increments
of 40 while multithreaded executions were run from 120 to 6000 in increments of 120.25 The data points in all
performance graphs report the best of three trials.
Individual graphs are labeled according to the mixed-datatype case of its Internal and Ad-hoc implementations.
The datatypes are encoded as cabx, where the characters c, a, and b encode the storage datatypes of C, A, and B,
respectively, while x encodes the computation precision. For example, a case labeled “zcsd” would refer to mixed-
domain Case 2ac, where matrices A and B are stored in single-precision (complex and real, respectively), matrix C
is double-precision complex, and the computation occurs in double-precision arithmetic.
All experiments reflect the use of randomized, column-stored matrices with gemm scalars α = 1 and β = 1.
8.2.2 Exposition
Figure 6 (top) reports sequential performance on the Marvell ThunderX2. The six graphs on the left half of Figure 6
(top) report performance for six select mixed-datatype cases that we felt are interesting: sdds, ddds, dsss, ccss,
cscs, and csss. All of these mixed-datatype cases perform their computation in single-precision. On the right half,
25 Some readers may wish that we had run our multithreaded experiments to a somewhat larger maximum problem sizes, perhaps
8,000 or 10,000. We sympathize with these readers. However, the results presented in this article, including Online Appendix A, required
a total of 302,400 invocations of gemm performed over a period of several days. Limiting the maximum problem size was necessary so
that the experiments would finish in a reasonable amount of time.
15
800 1600
0
10
20
30
G
FL
O
PS
sdds
800 1600
0
10
20
30
ccss
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
5
10
15
dssd
800 1600
0
5
10
15
zzdd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
10
20
30
G
FL
O
PS
ddds
800 1600
0
10
20
30
cscs
800 1600
0
5
10
15
sssd
800 1600
0
5
10
15
zdzd
800 1600
     m = n = k
0
10
20
30
G
FL
O
PS
dsss
800 1600
     m = n = k
0
10
20
30
csss
800 1600
     m = n = k
0
5
10
15
sddd
800 1600
     m = n = k
0
5
10
15
zddd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
sdds
2500 5000
0
10
20
30
ccss
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
5
10
15
dssd
2500 5000
0
5
10
15
zzdd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
ddds
2500 5000
0
10
20
30
cscs
2500 5000
0
5
10
15
sssd
2500 5000
0
5
10
15
zdzd
2500 5000
     m = n = k
0
10
20
30
G
FL
O
PS
/c
or
e
dsss
2500 5000
     m = n = k
0
10
20
30
csss
2500 5000
     m = n = k
0
5
10
15
sddd
2500 5000
     m = n = k
0
5
10
15
zddd
Figure 6: Sequential (top) and multithreaded with 28 threads (bottom) performance of “Internal” and “Ad-hoc”
implementations of gemm for select datatype combinations on a Marvell ThunderX2 CN9975 processor. The
12 graphs on the left side and right sides report computation in single- and double-precision, respectively. The
theoretical peak performance coincides with the top of each graph.
16
800 1600
0
50
100
G
FL
O
PS
sdds
800 1600
0
50
100
ccss
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
20
40
60
dssd
800 1600
0
20
40
60
zzdd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
50
100
G
FL
O
PS
ddds
800 1600
0
50
100
cscs
800 1600
0
20
40
60
sssd
800 1600
0
20
40
60
zdzd
800 1600
     m = n = k
0
50
100
G
FL
O
PS
dsss
800 1600
     m = n = k
0
50
100
csss
800 1600
     m = n = k
0
20
40
60
sddd
800 1600
     m = n = k
0
20
40
60
zddd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
sdds
2500 5000
0
50
100
ccss
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
20
40
60
dssd
2500 5000
0
20
40
60
zzdd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
ddds
2500 5000
0
50
100
cscs
2500 5000
0
20
40
60
sssd
2500 5000
0
20
40
60
zdzd
2500 5000
     m = n = k
0
50
100
G
FL
O
PS
/c
or
e
dsss
2500 5000
     m = n = k
0
50
100
csss
2500 5000
     m = n = k
0
20
40
60
sddd
2500 5000
     m = n = k
0
20
40
60
zddd
Figure 7: Sequential (top) and multithreaded with 26 threads (bottom) performance of “Internal” and “Ad-hoc”
implementations of gemm for select datatype combinations on a Intel Xeon Platinum 8167M processor. The
12 graphs on the left side and right sides report computation in single- and double-precision, respectively. The
theoretical peak performance coincides with the top of each graph.
17
we display graphs that correspond to the precision-toggled analogues of the graphs on the left—dssd, sssd, sddd,
zzdd, zdzd, and zddd—all of which perform their computation in double-precision. Both Internal implementations
(with and without extra memory) are shown in four of the six graphs in each group, with the remaining two graphs
displaying performance only for the implementation where extra memory is disabled, since the optimization is not
applicable for those cases.
The legends are shown once for each group of six graphs. Within the legend, the curves labeled “Intern (+xm)”
and “Intern (-xm)” refer to the Internal implementations with and without extra memory, respectively. Additionally,
the label for the Reference implementation is augmented, in parentheses, with the conventional gemm routine that
serves as the reference curve within that group of six graphs.
Organized identically as those in the top, the graphs in Figure 6 (bottom) report multithreaded performance on
one socket (28 threads) of the ThunderX2.
Finally, Figure 7 (top) and (bottom) report sequential and single-socket (26 threads) performance, respectively,
on the Intel Xeon Platinum using the same mixed-datatype cases and organization as shown in Figure 6.
8.2.3 Observations and Analysis
Let us turn first to the sequential performance results from the ThunderX2.
The first thing we notice is that, in five of the six mixed-datatype cases, the Ad-hoc approach is capable of
performing quite well relative to the Internal implementations. The sixth case, which falls within mixed-domain
Case 2bc, suffers because, unlike with 2ac, we are unable to cast the problem in terms of sgemm() or dgemm() by
manipulating matrix metadata, and therefore must resort to using cgemm() and zgemm(). And because Ai = 0, this
approach necessarily wastes half of the floating-point computations.
Next, we notice that employing Ctemp often affords a modest but noticeable increase in performance relative to
forgoing extra memory. This is expected for all of the reasons described in Sections 6.2–6.4.
Turning to the multithreaded performance on ThunderX2, we notice that the effect of using extra memory in
the Internal implementation is not only reversed, but also magnified. Here, the additional costs incurred within the
virtual microkernel are overwhelmed by the cost of the accumulation of Ctemp back to C that must be performed
after the gemm operation, which is not parallelized.26 The performance of the Ad-hoc implementation also suffers,
for similar reasons to that of the Internal implementation using Ctemp. The effect is even worse for Ad-hoc, however,
because that implementation must make whole copies of matrices up-front, and does so sequentially, before executing
the underlying gemm operation. By contrast, the Internal implementations benefit from typecasting A and B during
packing, which is already parallelized.
Overall, the extra-memory-avoiding Internal implementation performs quite well relative to its sgemm() and
dgemm() benchmarks.
Turning to the Intel Xeon Platinum results in Figure 7, we find the data largely tells the same story. Here, the
multithreaded performance degradation caused by employing extra memory is even more severe, and the Ad-hoc
performance is similarly attenuated. Once again, in both sequential and multithreaded cases, one of the Internal
implementations matches or exceeds (sometimes by a large margin) that of the Ad-hoc approach. However, for some
datatype cases, even the memory-avoiding Internal implementation lags noticeably behind its sgemm() or dgemm()
benchmark. The cause of this is not immediately clear, but may be related to memory bandwidth becoming strained
in cases where the the virtual microkernel is relied upon to update the output matrix in two steps, using a temporary
microtile as intermediate storage.
These results strongly suggest that, in general, BLIS should employ the use of Internal implementation with
Ctemp for sequential invocations of gemm, but avoid Ctemp in the case of many-threaded execution. As a function
of the number of threads, the crossover point between the two Internal implementations will likely depend on the
amount of parallelism that can be extracted within the accumulation of Ctemp back to C before memory bandwidth
is saturated. We leave this topic for future exploration.
26 This copy/accumulation operation, while not parallelized for any of the implementations tested for this article, could in principle
be parallelized. However, speedup for that component of the gemm would likely be limited as the many threads quickly saturate the
available memory bandwidth.
18
9 Measuring the impact on code size
Prior to reading the article, a casual reader might have been skeptical of the practicality of our solution. However,
Sections 4 and 5 decompose the problem into mostly orthogonal use cases, giving hope that the ultimate impact on
library code size is much more manageable.
Indeed, by multiple measures, the BLIS library grew only modestly after introducing mixed-datatype support.
Framework
source code
Total lines Change
Total size
(Kilobytes)
Change
Before 148,862 4,706
After 154,962 +6,100 4,892 +186
Figure 8: The total number of lines and the total size (in kilobytes) of source code in the BLIS framework (excluding
the build system, kernels, and the testsuite) before (git commit 667d3929) and after (git commit 5fec95b9) support
was added for mixed-datatype computation via the gemm operation.
Testsuite
source code
Total lines Change
Total size
(Kilobytes)
Change
Before 22,891 680
After 24,356 +1,465 722 +42
Figure 9: The total number of lines and the total size (in kilobytes) of source code in the BLIS testsuite before
(667d3929) and after (5fec95b9) support was added for mixed-datatype computation via the gemm operation.
Object code size
(Kilobytes)
Static
library
Change
Shared
library
Change
Statically-
linked
testsuite
Change
Before 3,141 2,286 1,632
After (disabled) 3,253 +112 2,382 +94 1,720 +88
After (enabled) 3,366 +225 2,486 +200 1,820 +188
Figure 10: The total object code size for three build products: BLIS built as a static library; BLIS built as a shared
library; and the BLIS testsuite linked against the aforementioned static library. Object code sizes are given using the:
BLIS library just prior to mixed-datatype support (667d3929); with mixed-datatype support present (5fec95b9)
but disabled at configure-time; and with mixed-datatype support present (5fec95b9) and enabled at configure-time.
The second column in Figure 8 shows the total number of lines27 of code present in the BLIS framework proper—
which excludes other components such as the build system, kernels, and the testsuite—before and after mixed-
datatype support was added to the gemm operation.28 The fourth column shows the total size in kilobytes of the
source code. The change between the “before” and “after” values for total lines and total size are shown in the third
and fifth columns, respectively. Mixed-datatype support for the gemm operation adds approximately 4% each to the
total number of lines and total bytes of source code.
Figure 9 lists similar metrics for the BLIS testsuite, which is capable of testing the vast majority of BLIS’s
computational operations. Here, the support for testing all combinations of mixed-datatype execution, with any
27 In this section, we uniformly report total lines of code, including blank lines and comments.
28 The “before” and “after” snapshots of BLIS are uniquely identified with high probability by the first eight digits of the git commit
(SHA1 hash) numbers. Commit 667d3929 identifies the code just before mixed-datatype support was added, while 5fec95b9 identifies
the first commit in which mixed-datatype support is present. This latter commit includes virtually all changes discussed in this article
with the exception of the mixed-precision support for the 1m method, which was added in a later commit (375eb30b).
19
combination of matrix storage storage or transposition and/or combinations, increases the source code footprint by
approximately 6% in both lines and total size.
Finally, Figure 10 shows the object (binary) code size for three build products: BLIS built as a static library;
BLIS built as a shared library; and the BLIS testsuite linked against the static library.29 This figure shows three
rows of values: before mixed-datatype support for gemm was added (667d3929); after mixed-datatype support was
added (5fec95b9) where the feature was disabled at configure-time; and after mixed-datatype support was added
(5fec95b9) where the feature was enabled at configure-time. (In the the latter two cases, the “Change” columns
represent the change from the “before” state.) With mixed-datatype support present and enabled, the size of the
static library increases by only 255KB, or 8% of the original library size. In the case of the shared library, the increase
is just under 9%. And a statically-linked instance of the BLIS testsuite increases by about 11%.
Thus, no matter the metric, the increase in code footprint is quite modest relative to the scope of functionality
added.
10 Final Thoughts
We conclude this article by sharing with the reader insights and observations that we have drawn to-date about
BLIS, BLAS, and the adoption of new software functionality by the broader HPC community.
10.1 Case studies
In late 1990’s, various community participants convened multiple meetings of the BLAS Technical (BLAST) Forum
to discuss extensions to the original BLAS [3]. Some of these extensions targeted extended and mixed-precision
functionality and were eventually implemented and branded as XBLAS [18, 29]. In the end, the mixed-precision
extensions were not widely adopted. We speculate that this was in part due to the fact that the reference implemen-
tation falls short of achieving high performance. By contrast, when the reference codes of the original BLAS [8, 2]
were introduced, compilers could easily translate them to binary implementations that would yield high performance
on the vector supercomputers that were prominent at the time. Though computer architectures have evolved since
then, we suspect that the initial availability of high-performance reference BLAS implementations helped give rise
to a network of institutional experts and expertise, laying the foundation for today’s well-established constellation
of BLAS solutions.
A classic example of the release of a new standard hand-in-hand with a high quality implementation was the
Message-Passing Interface (MPI) [12, 23]. From the start, an implementation that later become known as MPICH
provided a high-performance reference implementation [5]. Another example is TBLIS, a library and framework for
performing efficient contractions and related operations on tensors [19]. TBLIS borrows some of the insights of BLIS
and then goes further to construct a general-purpose tensor library from scratch. This new software architecture
avoids the drawbacks of ad-hoc, BLAS-based solutions that must first reorder tensors into column-major storage
simply for the purpose of calling dgemm(). The result is a comprehensive set of tensor functionality that exports
flexible APIs (in C89 and C++11) while also facilitating higher performance for both sequential and multithreaded
applications. In contrast to the BLAST extensions, we feel that MPICH and TBLIS serve as important examples
of software projects that each made an impact in their community by coupling new APIs and functionality with a
complete high-performance reference implementation.
10.2 Managing complexity
Supporting the goal of providing a high-performance reference implementation with new APIs, while a laudable step
in the right direction, is ultimately insufficient if the software is not designed to be practically implementable and
maintainable. Suppose we attempted to solve the problem of mixed-datatype gemm by implementing a solution in
the style of the original reference BLAS. Such an approach tends to lead to implementations that are duplicative
and vertically siloed from one another, with support for each datatype, storage case, and transposition/conjugation
scenario resulting in a fully independent block of code. On top of supporting four (potentially differing) storage
datatypes across all three matrix operands, a fully independent computation precision, and an fourth “conjugation
29 These object codes were built using GNU gcc 5.4.0 on an Intel Xeon E3-1271 v3 (Haswell) workstation.
20
only” transposition parameter value, BLIS also supports three different storage formats (row, column, and general
storage) for each matrix operand. This would lead to 31,104 separate implementations. If two additional precisions
are supported—half-precision and quad-precision, for example—this number grows to 497,664.
Our takeaway from this analysis, and our past experiences, is that achieving a complete high-performance refer-
ence implementation for mixed-datatype gemm requires careful management of complexity in the implementation.
Complexity must be managed not only within interfaces (e.g., via object-based APIs) but also internally by allowing
feature “decision points” (e.g., typecasting during packing) to work together in sequence, rather than in duplication,
to enable the desired combinatoric space of functionality while collapsing its corresponding axial dimensions within
the source code. Otherwise, the solution quickly becomes unwieldy, if not hopelessly intractable, for its maintainers.
10.3 Thesis
It is our conjecture that interfaces to new functionality, such as the mixed-datatype gemm presented in this article,
should be accompanied by a corresponding high-performance reference implementation, and that the key to making
such a goal attainable is managing combinatoric software complexity. Aside from serving several obvious purposes—a
research proof-of-concept, a reference for other developers, a working solution upon which end-users can rely—the
reference implementation, once instantiated, confers another, less tangible benefit. Namely, it creates the initial
conditions in which a community can form and a tangible product around which its members can organize, collaborate,
and advance towards its shared objectives. Therefore, this approach ultimately benefits all stakeholders.
11 Software availability
The software referenced in this article may be found at the BLIS project page [4] on GitHub along with documentation,
examples, links to discussion forums, and other related resources.
12 Acknowledgments
We kindly thank Jeff R. Hammond and Devin A. Matthews for referring us to appropriate citations for various
applications and use cases that stand to benefit from implementing mixed-domain and mixed-precision matrix mul-
tiplication functionality. We also acknowledge Matthews for participating in and facilitating early discussions on the
topic of mixed-domain and mixed-precision computation. Also, we thank Marvell and Oracle for providing access to
the Marvell ThunderX2 CN9975 and Intel Xeon Platinum 8167M servers, respectively, on which performance data
for this article was gathered. Finally, we thank members of the Science of High-Performance Computing (SHPC)
group for their contributions throughout the research towards and drafting of this article.
This research was partially sponsored by grants from Oracle, Huawei, and the National Science Foundation
(Awards ACI-1550493 and ACI-1714091). Any opinions, findings and conclusions or recommendations expressed in
this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation
(NSF).
References
[1] E. Apra, E. J. Bylaska, W. A. de Jong, N. Govind, K. Kowalski, T. P. Straatsma, M. Valiev, H. J. J. van Dam,
D. Wang, T. L. Windus, J. Hammond, J. Autschbach, K. Bhaskaran-Nair, J. Brabec, K. Lopata, S. A. Fischer,
S. Krishnamoorthy, M. Jacquelin, W. Ma, M. Klemm, O. Villa, Y. Chen, V. Anisimov, F. Aquino, S. Hirata,
M. T. Hackler, V. Konjkov, D. Mejia-Rodriguez, T. Risthaus, M. Malagoli, A. Marenich, A. Otero de-la Roza,
J. Mullin, P. Nichols, R. Peverati, J. Pittner, Y. Zhao, P.-D. Fan, A. Fonari, M. J. Williamson, R. J. Harrison,
J. R. Rehr, M. Dupuis, D. Silverstein, D. M. A. Smith, J. Nieplocha, V. Tipparaju, M. Krishnan, B. E. Van
Kuiken, A. Vazquez-Mayagoitia, L. Jensen, M. Swart, Q. Wu, T. Van Voorhis, A. A. Auer, M. Nooijen, L. D.
Crosby, E. Brown, G. Cisneros, G. I. Fann, H. Fruchtl, J. Garza, K. Hirao, R. A. Kendall, J. A. Nichols,
K. Tsemekhman, K. Wolinski, J. Anchell, D. E. Bernholdt, P. Borowski, T. Clark, D. Clerc, H. Dachsel,
M. J. O. Deegan, K. Dyall, D. Elwood, E. Glendening, M. Gutowski, A. C. Hess, J. Jaffe, B. G. Johnson, J. Ju,
21
R. Kobayashi, R. Kutteh, Z. Lin, R. Littlefield, X. Long, B. Meng, T. Nakajima, S. Niu, L. Pollack, M. Rosing,
K. Glaesemann, G. Sandrone, M. Stave, H. Taylor, G. Thomas, J. H. van Lenthe, A. T. Wong, and Z. Zhang.
NWChem, a computational chemistry package for parallel computers, version 6.8, 2018.
[2] http://www.netlib.org/blas, 2019.
[3] Basic linear algebra subprograms technical forum standard. International Journal of High Performance Appli-
cations and Supercomputing, 16(1), Spring 2002.
[4] https://github.com/flame/blis, 2019.
[5] Patrick Bridges, Nathan Doss, William Gropp, Edward Karrels, Ewing Lusk, and Anthony Skjellum. User’s
Guide for mpich, a Portable Implementation of MPI. Argonne National Laboratory, Sept. 1995.
[6] Sonia Coriani, Ove Christiansen, Thomas Fransson, and Patrick Norman. Coupled-cluster response theory for
near-edge x-ray-absorption fine structure of atoms and molecules. Phys. Rev. A, 85:022507, Feb 2012.
[7] T. Daniel Crawford and Henry F. Schaefer. An introduction to coupled cluster theory for computational chemists.
In Kenny B. Lipkowitz and Donald B. Boyd, editors, Reviews in Computational Chemistry, pages 33–136. Wiley-
Blackwell, 2007.
[8] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. A set of level 3 basic linear algebra
subprograms. ACM Trans. Math. Soft., 16(1):1–17, March 1990.
[9] Kazushige Goto and Robert van de Geijn. Anatomy of high-performance matrix multiplication. ACM Trans.
Math. Soft., 34(3: Article 12, 25 pages), May 2008.
[10] Kazushige Goto and Robert van de Geijn. High-performance implementation of the level-3 BLAS. ACM Trans.
Math. Soft., 35(1):4:1–4:14, July 2008b.
[11] Kazushige Goto and Robert A. van de Geijn. On reducing TLB misses in matrix multiplication. Technical
Report TR-02-55, Department of Computer Sciences, The University of Texas at Austin, November 2002.
[12] W. Gropp, E. Lusk, and A. Skjellum. Using MPI. The MIT Press, 1994.
[13] J. Huang, L. Rice, D. A. Matthews, and R. A. van d Geijn. Generating families of practical fast matrix
multiplication algorithms. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS),
pages 656–667, May 2017.
[14] Jianyu Huang, Tyler M. Smith, Greg M. Henry, and Robert A. van de Geijn. Strassen’s algorithm reloaded.
In Proceedings of the International Conference for High Performance Computing, Networking, Storage and
Analysis, SC ’16, pages 59:1–59:12, Piscataway, NJ, USA, 2016. IEEE Press.
[15] Thomas-C. Jagau, Ksenia B. Bravaya, and Anna I. Krylov. Extending quantum chemistry of bound states to
electronic resonances. Annual Review of Physical Chemistry, 68(1):525–553, 2017. PMID: 28463649.
[16] Daya S. Khudia, Protonu Basu, and Summer Deng. Open-sourcing FBGEMM for state-of-the-art server-side
inference. https://code.fb.com/ml-applications/fbgemm/, 2018.
[17] Kasper Kristensen, Joanna Kauczor, Thomas Kjrgaard, and Poul Jrgensen. Quasienergy formulation of damped
response theory. The Journal of Chemical Physics, 131(4):044112, 2009.
[18] Xiaoye S. Li, James W. Demmel, David H. Bailey, Greg Henry, Yozo Hida, Jimmy Iskandar, W. Kahan, Anil
Kapur, Michael C. Martin, Brandon J. Thompson, Teresa Tung, and Daniel J. Yoo. Design, implementation
and testing of extended and mixed precision BLAS. LAPACK Working Note 149, October 2000.
[19] Devin Matthews. High-performance tensor contraction without transposition. SIAM J. Sci. Comput., 40(1):C1–
C24, 2018.
22
[20] Marcel Nooijen and Jaap G. Snijders. Coupled cluster approach to the single-particle Green’s function. Inter-
national Journal of Quantum Chemistry, 44(S26):55–83, 1992.
[21] https://github.com/xianyi/OpenBLAS, 2019.
[22] Tyler M Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R Hammond, and Field G Van Zee. Anatomy
of high-performance many-threaded matrix multiplication. In Parallel and Distributed Processing Symposium,
2014 IEEE 28th International, pages 1049–1059. IEEE, 2014.
[23] Marc Snir, Steve W. Otto, Steven Huss-Lederman, David W. Walker, and Jack Dongarra. MPI: The Complete
Reference. The MIT Press, 1996.
[24] John F. Stanton. Why CCSD(T) works: a different perspective. Chemical Physics Letters, 281(1):130–134,
1997.
[25] M. Valiev, E.J. Bylaska, N. Govind, K. Kowalski, T.P. Straatsma, H.J.J. Van Dam, D. Wang, J. Nieplocha,
E. Apra, T.L. Windus, and W.A. de Jong. Nwchem: A comprehensive and scalable open-source solution for
large scale molecular simulations. Computer Physics Communications, 181(9):1477–1489, 2010.
[26] Field G. Van Zee. Implementing high-performance complex matrix multiplication via the 1m method. ACM
Trans. Math. Soft., 2018. submitted.
[27] Field G. Van Zee and Tyler M. Smith. Implementing high-performance complex matrix multiplication via the
3m and 4m methods. ACM Trans. Math. Soft., 44(1):7:1–7:36, June 2017.
[28] Field G. Van Zee and Robert A. van de Geijn. BLIS: A framework for rapidly instantiating BLAS functionality.
ACM Trans. Math. Soft., 41(3):14:1–14:33, June 2015.
[29] http://www.netlib.org/xblas, 2019.
[30] Chenhan D. Yu, Jianyu Huang, Woody Austin, Bo Xiao, and George Biros. Performance optimization for
the k-nearest neighbors kernel on x86 architectures. In Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis, SC ’15, pages 7:1–7:12, New York, NY, USA, 2015.
ACM.
23
Appendix A Complete performance results
This section contains complete performance results using the same hardware, implementations, and test parameters
discussed in Section 8. We report 128 performance graphs for each combination of sequential, single-socket, and dual-
socket execution on both the Marvell ThunderX2 and Intel Xeon Platinum, resulting in a total of 128× 3× 2 = 768
graphs.
Performance graphs are organized into one set for each mixed-domain case, with each set containing the 16
possible precision combinations within that case. The mixed-domain sets of graphs appear in pairs (top and bottom)
across Figures 11 to 34. Within a figure, graphs in the left and center-left columns report performance using a
computation precision of single precision while those in the right and center-right columns show performance using a
computation precision of double precision. Furthermore, the graphs are organized such that, for any given graph in
the single-precision computation subgroup, the graph located two spots to its right corresponds to the experiments
where all precisions are toggled (from single to double or vice versa).
Within the graphs for any given mixed-domain set, the legends are omitted from all except the top-right graph
within each computation precision subgroup—that is, the top graph in the center-left and right columns. As with
the graphs presented in Section 8, the “Reference” curves are listed as “Ref (?gemm)”, where the ? indicates one of
{s,d,c,z}. This added detail serves to remind the reader which datatype-specific variant of BLIS’s conventional
(datatype-homogeneous) gemm is the most appropriate curve against which to judge the “Internal” and “Ad-hoc”
mixed-datatype implementations. In general, the graphs in the left and right halves of Figures 11–34 (top and
bottom) use sgemm() and dgemm() as reference curves, respectively, except for the mixed-domain Case 3 results in
the bottom halves of Figures 11, 15, 19, 23, 27, and 31, which compare against cgemm() and zgemm()in the left and
right halves, respectively.
We provide the following table to aid the reader in finding the performance graphs associated with any given
mixed-domain case, for each hardware and threading configuration.
Hardware Threads Cases 0, 3 1a, 1b 2ab, 1c 2bc, 2ac
ThunderX2
1 Fig. 11 Fig. 12 Fig. 13 Fig. 14
28 Fig. 15 Fig. 16 Fig. 17 Fig. 18
56 Fig. 19 Fig. 20 Fig. 21 Fig. 22
Xeon Platinum
1 Fig. 23 Fig. 24 Fig. 25 Fig. 26
26 Fig. 27 Fig. 28 Fig. 29 Fig. 30
52 Fig. 31 Fig. 32 Fig. 33 Fig. 34
24
800 1600
0
10
20
30
G
FL
O
PS
ssss
800 1600
0
10
20
30
ssds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
5
10
15
dddd
800 1600
0
5
10
15
ddsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
10
20
30
G
FL
O
PS
sdss
800 1600
0
10
20
30
sdds
800 1600
0
5
10
15
dsdd
800 1600
0
5
10
15
dssd
800 1600
0
10
20
30
G
FL
O
PS
dsss
800 1600
0
10
20
30
dsds
800 1600
0
5
10
15
sddd
800 1600
0
5
10
15
sdsd
800 1600
     m = n = k
0
10
20
30
G
FL
O
PS
ddss
800 1600
     m = n = k
0
10
20
30
ddds
800 1600
     m = n = k
0
5
10
15
ssdd
800 1600
     m = n = k
0
5
10
15
sssd
800 1600
0
10
20
30
G
FL
O
PS
cccs
800 1600
0
10
20
30
cczs
Ref (cgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
5
10
15
zzzd
800 1600
0
5
10
15
zzcd
Ref (zgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
10
20
30
G
FL
O
PS
czcs
800 1600
0
10
20
30
czzs
800 1600
0
5
10
15
zczd
800 1600
0
5
10
15
zccd
800 1600
0
10
20
30
G
FL
O
PS
zccs
800 1600
0
10
20
30
zczs
800 1600
0
5
10
15
czzd
800 1600
0
5
10
15
czcd
800 1600
     m = n = k
0
10
20
30
G
FL
O
PS
zzcs
800 1600
     m = n = k
0
10
20
30
zzzs
800 1600
     m = n = k
0
5
10
15
cczd
800 1600
     m = n = k
0
5
10
15
cccd
Figure 11: Sequential performance of “Internal” and “Ad-hoc” implementations of gemm for all precision combina-
tions within mixed-domain Cases 0 (top) and 3 (bottom) on a Marvell ThunderX2 CN9975 processor. The 16 graphs
on the left side and right sides report computation in single- and double-precision, respectively. The theoretical peak
performance coincides with the top of each graph.
25
800 1600
0
10
20
30
G
FL
O
PS
sscs
800 1600
0
10
20
30
sszs
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
5
10
15
ddzd
800 1600
0
5
10
15
ddcd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
10
20
30
G
FL
O
PS
sdcs
800 1600
0
10
20
30
sdzs
800 1600
0
5
10
15
dszd
800 1600
0
5
10
15
dscd
800 1600
0
10
20
30
G
FL
O
PS
dscs
800 1600
0
10
20
30
dszs
800 1600
0
5
10
15
sdzd
800 1600
0
5
10
15
sdcd
800 1600
     m = n = k
0
10
20
30
G
FL
O
PS
ddcs
800 1600
     m = n = k
0
10
20
30
ddzs
800 1600
     m = n = k
0
5
10
15
sszd
800 1600
     m = n = k
0
5
10
15
sscd
800 1600
0
10
20
30
G
FL
O
PS
scss
800 1600
0
10
20
30
scds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
5
10
15
dzdd
800 1600
0
5
10
15
dzsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
10
20
30
G
FL
O
PS
szss
800 1600
0
10
20
30
szds
800 1600
0
5
10
15
dcdd
800 1600
0
5
10
15
dcsd
800 1600
0
10
20
30
G
FL
O
PS
dcss
800 1600
0
10
20
30
dcds
800 1600
0
5
10
15
szdd
800 1600
0
5
10
15
szsd
800 1600
     m = n = k
0
10
20
30
G
FL
O
PS
dzss
800 1600
     m = n = k
0
10
20
30
dzds
800 1600
     m = n = k
0
5
10
15
scdd
800 1600
     m = n = k
0
5
10
15
scsd
Figure 12: Sequential performance of “Internal” and “Ad-hoc” implementations of gemm for all precision combi-
nations within mixed-domain Cases 1a (top) and 1b (bottom) on a Marvell ThunderX2 CN9975 processor. The
16 graphs on the left side and right sides report computation in single- and double-precision, respectively. The
theoretical peak performance coincides with the top of each graph.
26
800 1600
0
10
20
30
G
FL
O
PS
sccs
800 1600
0
10
20
30
sczs
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
5
10
15
dzzd
800 1600
0
5
10
15
dzcd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
10
20
30
G
FL
O
PS
szcs
800 1600
0
10
20
30
szzs
800 1600
0
5
10
15
dczd
800 1600
0
5
10
15
dccd
800 1600
0
10
20
30
G
FL
O
PS
dccs
800 1600
0
10
20
30
dczs
800 1600
0
5
10
15
szzd
800 1600
0
5
10
15
szcd
800 1600
     m = n = k
0
10
20
30
G
FL
O
PS
dzcs
800 1600
     m = n = k
0
10
20
30
dzzs
800 1600
     m = n = k
0
5
10
15
sczd
800 1600
     m = n = k
0
5
10
15
sccd
800 1600
0
10
20
30
G
FL
O
PS
csss
800 1600
0
10
20
30
csds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
5
10
15
zddd
800 1600
0
5
10
15
zdsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
10
20
30
G
FL
O
PS
cdss
800 1600
0
10
20
30
cdds
800 1600
0
5
10
15
zsdd
800 1600
0
5
10
15
zssd
800 1600
0
10
20
30
G
FL
O
PS
zsss
800 1600
0
10
20
30
zsds
800 1600
0
5
10
15
cddd
800 1600
0
5
10
15
cdsd
800 1600
     m = n = k
0
10
20
30
G
FL
O
PS
zdss
800 1600
     m = n = k
0
10
20
30
zdds
800 1600
     m = n = k
0
5
10
15
csdd
800 1600
     m = n = k
0
5
10
15
cssd
Figure 13: Sequential performance of “Internal” and “Ad-hoc” implementations of gemm for all precision combi-
nations within mixed-domain Cases 2ab (top) and 1c (bottom) on a Marvell ThunderX2 CN9975 processor. The
16 graphs on the left side and right sides report computation in single- and double-precision, respectively. The
theoretical peak performance coincides with the top of each graph.
27
800 1600
0
10
20
30
G
FL
O
PS
cscs
800 1600
0
10
20
30
cszs
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
5
10
15
zdzd
800 1600
0
5
10
15
zdcd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
10
20
30
G
FL
O
PS
cdcs
800 1600
0
10
20
30
cdzs
800 1600
0
5
10
15
zszd
800 1600
0
5
10
15
zscd
800 1600
0
10
20
30
G
FL
O
PS
zscs
800 1600
0
10
20
30
zszs
800 1600
0
5
10
15
cdzd
800 1600
0
5
10
15
cdcd
800 1600
     m = n = k
0
10
20
30
G
FL
O
PS
zdcs
800 1600
     m = n = k
0
10
20
30
zdzs
800 1600
     m = n = k
0
5
10
15
cszd
800 1600
     m = n = k
0
5
10
15
cscd
800 1600
0
10
20
30
G
FL
O
PS
ccss
800 1600
0
10
20
30
ccds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
5
10
15
zzdd
800 1600
0
5
10
15
zzsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
10
20
30
G
FL
O
PS
czss
800 1600
0
10
20
30
czds
800 1600
0
5
10
15
zcdd
800 1600
0
5
10
15
zcsd
800 1600
0
10
20
30
G
FL
O
PS
zcss
800 1600
0
10
20
30
zcds
800 1600
0
5
10
15
czdd
800 1600
0
5
10
15
czsd
800 1600
     m = n = k
0
10
20
30
G
FL
O
PS
zzss
800 1600
     m = n = k
0
10
20
30
zzds
800 1600
     m = n = k
0
5
10
15
ccdd
800 1600
     m = n = k
0
5
10
15
ccsd
Figure 14: Sequential performance of “Internal” and “Ad-hoc” implementations of gemm for all precision com-
binations within mixed-domain Cases 2bc (top) and 2ac (bottom) on a Marvell ThunderX2 CN9975 processor.
The 16 graphs on the left side and right sides report computation in single- and double-precision, respectively. The
theoretical peak performance coincides with the top of each graph.
28
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
ssss
2500 5000
0
10
20
30
ssds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
5
10
15
dddd
2500 5000
0
5
10
15
ddsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
sdss
2500 5000
0
10
20
30
sdds
2500 5000
0
5
10
15
dsdd
2500 5000
0
5
10
15
dssd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
dsss
2500 5000
0
10
20
30
dsds
2500 5000
0
5
10
15
sddd
2500 5000
0
5
10
15
sdsd
2500 5000
     m = n = k
0
10
20
30
G
FL
O
PS
/c
or
e
ddss
2500 5000
     m = n = k
0
10
20
30
ddds
2500 5000
     m = n = k
0
5
10
15
ssdd
2500 5000
     m = n = k
0
5
10
15
sssd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
cccs
2500 5000
0
10
20
30
cczs
Ref (cgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
5
10
15
zzzd
2500 5000
0
5
10
15
zzcd
Ref (zgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
czcs
2500 5000
0
10
20
30
czzs
2500 5000
0
5
10
15
zczd
2500 5000
0
5
10
15
zccd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
zccs
2500 5000
0
10
20
30
zczs
2500 5000
0
5
10
15
czzd
2500 5000
0
5
10
15
czcd
2500 5000
     m = n = k
0
10
20
30
G
FL
O
PS
/c
or
e
zzcs
2500 5000
     m = n = k
0
10
20
30
zzzs
2500 5000
     m = n = k
0
5
10
15
cczd
2500 5000
     m = n = k
0
5
10
15
cccd
Figure 15: Multithreaded (28 threads) performance of “Internal” and “Ad-hoc” implementations of gemm for all
precision combinations within mixed-domain Cases 0 (top) and 3 (bottom) on a Marvell ThunderX2 CN9975 proces-
sor. The 16 graphs on the left side and right sides report computation in single- and double-precision, respectively.
The theoretical peak performance coincides with the top of each graph.
29
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
sscs
2500 5000
0
10
20
30
sszs
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
5
10
15
ddzd
2500 5000
0
5
10
15
ddcd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
sdcs
2500 5000
0
10
20
30
sdzs
2500 5000
0
5
10
15
dszd
2500 5000
0
5
10
15
dscd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
dscs
2500 5000
0
10
20
30
dszs
2500 5000
0
5
10
15
sdzd
2500 5000
0
5
10
15
sdcd
2500 5000
     m = n = k
0
10
20
30
G
FL
O
PS
/c
or
e
ddcs
2500 5000
     m = n = k
0
10
20
30
ddzs
2500 5000
     m = n = k
0
5
10
15
sszd
2500 5000
     m = n = k
0
5
10
15
sscd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
scss
2500 5000
0
10
20
30
scds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
5
10
15
dzdd
2500 5000
0
5
10
15
dzsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
szss
2500 5000
0
10
20
30
szds
2500 5000
0
5
10
15
dcdd
2500 5000
0
5
10
15
dcsd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
dcss
2500 5000
0
10
20
30
dcds
2500 5000
0
5
10
15
szdd
2500 5000
0
5
10
15
szsd
2500 5000
     m = n = k
0
10
20
30
G
FL
O
PS
/c
or
e
dzss
2500 5000
     m = n = k
0
10
20
30
dzds
2500 5000
     m = n = k
0
5
10
15
scdd
2500 5000
     m = n = k
0
5
10
15
scsd
Figure 16: Multithreaded (28 threads) performance of “Internal” and “Ad-hoc” implementations of gemm for all
precision combinations within mixed-domain Cases 1a (top) and 1b (bottom) on a Marvell ThunderX2 CN9975 pro-
cessor. The 16 graphs on the left side and right sides report computation in single- and double-precision, respectively.
The theoretical peak performance coincides with the top of each graph.
30
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
sccs
2500 5000
0
10
20
30
sczs
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
5
10
15
dzzd
2500 5000
0
5
10
15
dzcd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
szcs
2500 5000
0
10
20
30
szzs
2500 5000
0
5
10
15
dczd
2500 5000
0
5
10
15
dccd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
dccs
2500 5000
0
10
20
30
dczs
2500 5000
0
5
10
15
szzd
2500 5000
0
5
10
15
szcd
2500 5000
     m = n = k
0
10
20
30
G
FL
O
PS
/c
or
e
dzcs
2500 5000
     m = n = k
0
10
20
30
dzzs
2500 5000
     m = n = k
0
5
10
15
sczd
2500 5000
     m = n = k
0
5
10
15
sccd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
csss
2500 5000
0
10
20
30
csds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
5
10
15
zddd
2500 5000
0
5
10
15
zdsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
cdss
2500 5000
0
10
20
30
cdds
2500 5000
0
5
10
15
zsdd
2500 5000
0
5
10
15
zssd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
zsss
2500 5000
0
10
20
30
zsds
2500 5000
0
5
10
15
cddd
2500 5000
0
5
10
15
cdsd
2500 5000
     m = n = k
0
10
20
30
G
FL
O
PS
/c
or
e
zdss
2500 5000
     m = n = k
0
10
20
30
zdds
2500 5000
     m = n = k
0
5
10
15
csdd
2500 5000
     m = n = k
0
5
10
15
cssd
Figure 17: Multithreaded (28 threads) performance of “Internal” and “Ad-hoc” implementations of gemm for all
precision combinations within mixed-domain Cases 2ab (top) and 1c (bottom) on a Marvell ThunderX2 CN9975 pro-
cessor. The 16 graphs on the left side and right sides report computation in single- and double-precision, respectively.
The theoretical peak performance coincides with the top of each graph.
31
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
cscs
2500 5000
0
10
20
30
cszs
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
5
10
15
zdzd
2500 5000
0
5
10
15
zdcd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
cdcs
2500 5000
0
10
20
30
cdzs
2500 5000
0
5
10
15
zszd
2500 5000
0
5
10
15
zscd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
zscs
2500 5000
0
10
20
30
zszs
2500 5000
0
5
10
15
cdzd
2500 5000
0
5
10
15
cdcd
2500 5000
     m = n = k
0
10
20
30
G
FL
O
PS
/c
or
e
zdcs
2500 5000
     m = n = k
0
10
20
30
zdzs
2500 5000
     m = n = k
0
5
10
15
cszd
2500 5000
     m = n = k
0
5
10
15
cscd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
ccss
2500 5000
0
10
20
30
ccds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
5
10
15
zzdd
2500 5000
0
5
10
15
zzsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
czss
2500 5000
0
10
20
30
czds
2500 5000
0
5
10
15
zcdd
2500 5000
0
5
10
15
zcsd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
zcss
2500 5000
0
10
20
30
zcds
2500 5000
0
5
10
15
czdd
2500 5000
0
5
10
15
czsd
2500 5000
     m = n = k
0
10
20
30
G
FL
O
PS
/c
or
e
zzss
2500 5000
     m = n = k
0
10
20
30
zzds
2500 5000
     m = n = k
0
5
10
15
ccdd
2500 5000
     m = n = k
0
5
10
15
ccsd
Figure 18: Multithreaded (28 threads) performance of “Internal” and “Ad-hoc” implementations of gemm for all
precision combinations within mixed-domain Cases 2bc (top) and 2ac (bottom) on a Marvell ThunderX2 CN9975
processor. The 16 graphs on the left side and right sides report computation in single- and double-precision, respec-
tively. The theoretical peak performance coincides with the top of each graph.
32
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
ssss
2500 5000
0
10
20
30
ssds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
5
10
15
dddd
2500 5000
0
5
10
15
ddsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
sdss
2500 5000
0
10
20
30
sdds
2500 5000
0
5
10
15
dsdd
2500 5000
0
5
10
15
dssd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
dsss
2500 5000
0
10
20
30
dsds
2500 5000
0
5
10
15
sddd
2500 5000
0
5
10
15
sdsd
2500 5000
     m = n = k
0
10
20
30
G
FL
O
PS
/c
or
e
ddss
2500 5000
     m = n = k
0
10
20
30
ddds
2500 5000
     m = n = k
0
5
10
15
ssdd
2500 5000
     m = n = k
0
5
10
15
sssd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
cccs
2500 5000
0
10
20
30
cczs
Ref (cgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
5
10
15
zzzd
2500 5000
0
5
10
15
zzcd
Ref (zgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
czcs
2500 5000
0
10
20
30
czzs
2500 5000
0
5
10
15
zczd
2500 5000
0
5
10
15
zccd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
zccs
2500 5000
0
10
20
30
zczs
2500 5000
0
5
10
15
czzd
2500 5000
0
5
10
15
czcd
2500 5000
     m = n = k
0
10
20
30
G
FL
O
PS
/c
or
e
zzcs
2500 5000
     m = n = k
0
10
20
30
zzzs
2500 5000
     m = n = k
0
5
10
15
cczd
2500 5000
     m = n = k
0
5
10
15
cccd
Figure 19: Multithreaded (56 threads) performance of “Internal” and “Ad-hoc” implementations of gemm for all
precision combinations within mixed-domain Cases 0 (top) and 3 (bottom) on a Marvell ThunderX2 CN9975 proces-
sor. The 16 graphs on the left side and right sides report computation in single- and double-precision, respectively.
The theoretical peak performance coincides with the top of each graph.
33
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
sscs
2500 5000
0
10
20
30
sszs
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
5
10
15
ddzd
2500 5000
0
5
10
15
ddcd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
sdcs
2500 5000
0
10
20
30
sdzs
2500 5000
0
5
10
15
dszd
2500 5000
0
5
10
15
dscd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
dscs
2500 5000
0
10
20
30
dszs
2500 5000
0
5
10
15
sdzd
2500 5000
0
5
10
15
sdcd
2500 5000
     m = n = k
0
10
20
30
G
FL
O
PS
/c
or
e
ddcs
2500 5000
     m = n = k
0
10
20
30
ddzs
2500 5000
     m = n = k
0
5
10
15
sszd
2500 5000
     m = n = k
0
5
10
15
sscd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
scss
2500 5000
0
10
20
30
scds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
5
10
15
dzdd
2500 5000
0
5
10
15
dzsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
szss
2500 5000
0
10
20
30
szds
2500 5000
0
5
10
15
dcdd
2500 5000
0
5
10
15
dcsd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
dcss
2500 5000
0
10
20
30
dcds
2500 5000
0
5
10
15
szdd
2500 5000
0
5
10
15
szsd
2500 5000
     m = n = k
0
10
20
30
G
FL
O
PS
/c
or
e
dzss
2500 5000
     m = n = k
0
10
20
30
dzds
2500 5000
     m = n = k
0
5
10
15
scdd
2500 5000
     m = n = k
0
5
10
15
scsd
Figure 20: Multithreaded (56 threads) performance of “Internal” and “Ad-hoc” implementations of gemm for all
precision combinations within mixed-domain Cases 1a (top) and 1b (bottom) on a Marvell ThunderX2 CN9975 pro-
cessor. The 16 graphs on the left side and right sides report computation in single- and double-precision, respectively.
The theoretical peak performance coincides with the top of each graph.
34
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
sccs
2500 5000
0
10
20
30
sczs
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
5
10
15
dzzd
2500 5000
0
5
10
15
dzcd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
szcs
2500 5000
0
10
20
30
szzs
2500 5000
0
5
10
15
dczd
2500 5000
0
5
10
15
dccd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
dccs
2500 5000
0
10
20
30
dczs
2500 5000
0
5
10
15
szzd
2500 5000
0
5
10
15
szcd
2500 5000
     m = n = k
0
10
20
30
G
FL
O
PS
/c
or
e
dzcs
2500 5000
     m = n = k
0
10
20
30
dzzs
2500 5000
     m = n = k
0
5
10
15
sczd
2500 5000
     m = n = k
0
5
10
15
sccd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
csss
2500 5000
0
10
20
30
csds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
5
10
15
zddd
2500 5000
0
5
10
15
zdsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
cdss
2500 5000
0
10
20
30
cdds
2500 5000
0
5
10
15
zsdd
2500 5000
0
5
10
15
zssd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
zsss
2500 5000
0
10
20
30
zsds
2500 5000
0
5
10
15
cddd
2500 5000
0
5
10
15
cdsd
2500 5000
     m = n = k
0
10
20
30
G
FL
O
PS
/c
or
e
zdss
2500 5000
     m = n = k
0
10
20
30
zdds
2500 5000
     m = n = k
0
5
10
15
csdd
2500 5000
     m = n = k
0
5
10
15
cssd
Figure 21: Multithreaded (56 threads) performance of “Internal” and “Ad-hoc” implementations of gemm for all
precision combinations within mixed-domain Cases 2ab (top) and 1c (bottom) on a Marvell ThunderX2 CN9975 pro-
cessor. The 16 graphs on the left side and right sides report computation in single- and double-precision, respectively.
The theoretical peak performance coincides with the top of each graph.
35
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
cscs
2500 5000
0
10
20
30
cszs
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
5
10
15
zdzd
2500 5000
0
5
10
15
zdcd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
cdcs
2500 5000
0
10
20
30
cdzs
2500 5000
0
5
10
15
zszd
2500 5000
0
5
10
15
zscd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
zscs
2500 5000
0
10
20
30
zszs
2500 5000
0
5
10
15
cdzd
2500 5000
0
5
10
15
cdcd
2500 5000
     m = n = k
0
10
20
30
G
FL
O
PS
/c
or
e
zdcs
2500 5000
     m = n = k
0
10
20
30
zdzs
2500 5000
     m = n = k
0
5
10
15
cszd
2500 5000
     m = n = k
0
5
10
15
cscd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
ccss
2500 5000
0
10
20
30
ccds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
5
10
15
zzdd
2500 5000
0
5
10
15
zzsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
czss
2500 5000
0
10
20
30
czds
2500 5000
0
5
10
15
zcdd
2500 5000
0
5
10
15
zcsd
2500 5000
0
10
20
30
G
FL
O
PS
/c
or
e
zcss
2500 5000
0
10
20
30
zcds
2500 5000
0
5
10
15
czdd
2500 5000
0
5
10
15
czsd
2500 5000
     m = n = k
0
10
20
30
G
FL
O
PS
/c
or
e
zzss
2500 5000
     m = n = k
0
10
20
30
zzds
2500 5000
     m = n = k
0
5
10
15
ccdd
2500 5000
     m = n = k
0
5
10
15
ccsd
Figure 22: Multithreaded (56 threads) performance of “Internal” and “Ad-hoc” implementations of gemm for all
precision combinations within mixed-domain Cases 2bc (top) and 2ac (bottom) on a Marvell ThunderX2 CN9975
processor. The 16 graphs on the left side and right sides report computation in single- and double-precision, respec-
tively. The theoretical peak performance coincides with the top of each graph.
36
800 1600
0
50
100
G
FL
O
PS
ssss
800 1600
0
50
100
ssds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
20
40
60
dddd
800 1600
0
20
40
60
ddsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
50
100
G
FL
O
PS
sdss
800 1600
0
50
100
sdds
800 1600
0
20
40
60
dsdd
800 1600
0
20
40
60
dssd
800 1600
0
50
100
G
FL
O
PS
dsss
800 1600
0
50
100
dsds
800 1600
0
20
40
60
sddd
800 1600
0
20
40
60
sdsd
800 1600
     m = n = k
0
50
100
G
FL
O
PS
ddss
800 1600
     m = n = k
0
50
100
ddds
800 1600
     m = n = k
0
20
40
60
ssdd
800 1600
     m = n = k
0
20
40
60
sssd
800 1600
0
50
100
G
FL
O
PS
cccs
800 1600
0
50
100
cczs
Ref (cgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
20
40
60
zzzd
800 1600
0
20
40
60
zzcd
Ref (zgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
50
100
G
FL
O
PS
czcs
800 1600
0
50
100
czzs
800 1600
0
20
40
60
zczd
800 1600
0
20
40
60
zccd
800 1600
0
50
100
G
FL
O
PS
zccs
800 1600
0
50
100
zczs
800 1600
0
20
40
60
czzd
800 1600
0
20
40
60
czcd
800 1600
     m = n = k
0
50
100
G
FL
O
PS
zzcs
800 1600
     m = n = k
0
50
100
zzzs
800 1600
     m = n = k
0
20
40
60
cczd
800 1600
     m = n = k
0
20
40
60
cccd
Figure 23: Sequential performance of “Internal” and “Ad-hoc” implementations of gemm for all precision combina-
tions within mixed-domain Cases 0 (top) and 3 (bottom) on an Intel Xeon Platinum 8167M processor. The 16 graphs
on the left side and right sides report computation in single- and double-precision, respectively. The theoretical peak
performance coincides with the top of each graph.
37
800 1600
0
50
100
G
FL
O
PS
sscs
800 1600
0
50
100
sszs
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
20
40
60
ddzd
800 1600
0
20
40
60
ddcd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
50
100
G
FL
O
PS
sdcs
800 1600
0
50
100
sdzs
800 1600
0
20
40
60
dszd
800 1600
0
20
40
60
dscd
800 1600
0
50
100
G
FL
O
PS
dscs
800 1600
0
50
100
dszs
800 1600
0
20
40
60
sdzd
800 1600
0
20
40
60
sdcd
800 1600
     m = n = k
0
50
100
G
FL
O
PS
ddcs
800 1600
     m = n = k
0
50
100
ddzs
800 1600
     m = n = k
0
20
40
60
sszd
800 1600
     m = n = k
0
20
40
60
sscd
800 1600
0
50
100
G
FL
O
PS
scss
800 1600
0
50
100
scds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
20
40
60
dzdd
800 1600
0
20
40
60
dzsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
50
100
G
FL
O
PS
szss
800 1600
0
50
100
szds
800 1600
0
20
40
60
dcdd
800 1600
0
20
40
60
dcsd
800 1600
0
50
100
G
FL
O
PS
dcss
800 1600
0
50
100
dcds
800 1600
0
20
40
60
szdd
800 1600
0
20
40
60
szsd
800 1600
     m = n = k
0
50
100
G
FL
O
PS
dzss
800 1600
     m = n = k
0
50
100
dzds
800 1600
     m = n = k
0
20
40
60
scdd
800 1600
     m = n = k
0
20
40
60
scsd
Figure 24: Sequential performance of “Internal” and “Ad-hoc” implementations of gemm for all precision combi-
nations within mixed-domain Cases 1a (top) and 1b (bottom) on an Intel Xeon Platinum 8167M processor. The
16 graphs on the left side and right sides report computation in single- and double-precision, respectively. The
theoretical peak performance coincides with the top of each graph.
38
800 1600
0
50
100
G
FL
O
PS
sccs
800 1600
0
50
100
sczs
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
20
40
60
dzzd
800 1600
0
20
40
60
dzcd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
50
100
G
FL
O
PS
szcs
800 1600
0
50
100
szzs
800 1600
0
20
40
60
dczd
800 1600
0
20
40
60
dccd
800 1600
0
50
100
G
FL
O
PS
dccs
800 1600
0
50
100
dczs
800 1600
0
20
40
60
szzd
800 1600
0
20
40
60
szcd
800 1600
     m = n = k
0
50
100
G
FL
O
PS
dzcs
800 1600
     m = n = k
0
50
100
dzzs
800 1600
     m = n = k
0
20
40
60
sczd
800 1600
     m = n = k
0
20
40
60
sccd
800 1600
0
50
100
G
FL
O
PS
csss
800 1600
0
50
100
csds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
20
40
60
zddd
800 1600
0
20
40
60
zdsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
50
100
G
FL
O
PS
cdss
800 1600
0
50
100
cdds
800 1600
0
20
40
60
zsdd
800 1600
0
20
40
60
zssd
800 1600
0
50
100
G
FL
O
PS
zsss
800 1600
0
50
100
zsds
800 1600
0
20
40
60
cddd
800 1600
0
20
40
60
cdsd
800 1600
     m = n = k
0
50
100
G
FL
O
PS
zdss
800 1600
     m = n = k
0
50
100
zdds
800 1600
     m = n = k
0
20
40
60
csdd
800 1600
     m = n = k
0
20
40
60
cssd
Figure 25: Sequential performance of “Internal” and “Ad-hoc” implementations of gemm for all precision combi-
nations within mixed-domain Cases 2ab (top) and 1c (bottom) on an Intel Xeon Platinum 8167M processor. The
16 graphs on the left side and right sides report computation in single- and double-precision, respectively. The
theoretical peak performance coincides with the top of each graph.
39
800 1600
0
50
100
G
FL
O
PS
cscs
800 1600
0
50
100
cszs
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
20
40
60
zdzd
800 1600
0
20
40
60
zdcd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
50
100
G
FL
O
PS
cdcs
800 1600
0
50
100
cdzs
800 1600
0
20
40
60
zszd
800 1600
0
20
40
60
zscd
800 1600
0
50
100
G
FL
O
PS
zscs
800 1600
0
50
100
zszs
800 1600
0
20
40
60
cdzd
800 1600
0
20
40
60
cdcd
800 1600
     m = n = k
0
50
100
G
FL
O
PS
zdcs
800 1600
     m = n = k
0
50
100
zdzs
800 1600
     m = n = k
0
20
40
60
cszd
800 1600
     m = n = k
0
20
40
60
cscd
800 1600
0
50
100
G
FL
O
PS
ccss
800 1600
0
50
100
ccds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
20
40
60
zzdd
800 1600
0
20
40
60
zzsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
800 1600
0
50
100
G
FL
O
PS
czss
800 1600
0
50
100
czds
800 1600
0
20
40
60
zcdd
800 1600
0
20
40
60
zcsd
800 1600
0
50
100
G
FL
O
PS
zcss
800 1600
0
50
100
zcds
800 1600
0
20
40
60
czdd
800 1600
0
20
40
60
czsd
800 1600
     m = n = k
0
50
100
G
FL
O
PS
zzss
800 1600
     m = n = k
0
50
100
zzds
800 1600
     m = n = k
0
20
40
60
ccdd
800 1600
     m = n = k
0
20
40
60
ccsd
Figure 26: Sequential performance of “Internal” and “Ad-hoc” implementations of gemm for all precision com-
binations within mixed-domain Cases 2bc (top) and 2ac (bottom) on an Intel Xeon Platinum 8167M processor.
The 16 graphs on the left side and right sides report computation in single- and double-precision, respectively. The
theoretical peak performance coincides with the top of each graph.
40
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
ssss
2500 5000
0
50
100
ssds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
20
40
60
dddd
2500 5000
0
20
40
60
ddsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
sdss
2500 5000
0
50
100
sdds
2500 5000
0
20
40
60
dsdd
2500 5000
0
20
40
60
dssd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
dsss
2500 5000
0
50
100
dsds
2500 5000
0
20
40
60
sddd
2500 5000
0
20
40
60
sdsd
2500 5000
     m = n = k
0
50
100
G
FL
O
PS
/c
or
e
ddss
2500 5000
     m = n = k
0
50
100
ddds
2500 5000
     m = n = k
0
20
40
60
ssdd
2500 5000
     m = n = k
0
20
40
60
sssd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
cccs
2500 5000
0
50
100
cczs
Ref (cgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
20
40
60
zzzd
2500 5000
0
20
40
60
zzcd
Ref (zgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
czcs
2500 5000
0
50
100
czzs
2500 5000
0
20
40
60
zczd
2500 5000
0
20
40
60
zccd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
zccs
2500 5000
0
50
100
zczs
2500 5000
0
20
40
60
czzd
2500 5000
0
20
40
60
czcd
2500 5000
     m = n = k
0
50
100
G
FL
O
PS
/c
or
e
zzcs
2500 5000
     m = n = k
0
50
100
zzzs
2500 5000
     m = n = k
0
20
40
60
cczd
2500 5000
     m = n = k
0
20
40
60
cccd
Figure 27: Multithreaded (26 threads) performance of “Internal” and “Ad-hoc” implementations of gemm for all
precision combinations within mixed-domain Cases 0 (top) and 3 (bottom) on an Intel Xeon Platinum 8167M pro-
cessor. The 16 graphs on the left side and right sides report computation in single- and double-precision, respectively.
The theoretical peak performance coincides with the top of each graph.
41
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
sscs
2500 5000
0
50
100
sszs
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
20
40
60
ddzd
2500 5000
0
20
40
60
ddcd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
sdcs
2500 5000
0
50
100
sdzs
2500 5000
0
20
40
60
dszd
2500 5000
0
20
40
60
dscd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
dscs
2500 5000
0
50
100
dszs
2500 5000
0
20
40
60
sdzd
2500 5000
0
20
40
60
sdcd
2500 5000
     m = n = k
0
50
100
G
FL
O
PS
/c
or
e
ddcs
2500 5000
     m = n = k
0
50
100
ddzs
2500 5000
     m = n = k
0
20
40
60
sszd
2500 5000
     m = n = k
0
20
40
60
sscd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
scss
2500 5000
0
50
100
scds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
20
40
60
dzdd
2500 5000
0
20
40
60
dzsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
szss
2500 5000
0
50
100
szds
2500 5000
0
20
40
60
dcdd
2500 5000
0
20
40
60
dcsd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
dcss
2500 5000
0
50
100
dcds
2500 5000
0
20
40
60
szdd
2500 5000
0
20
40
60
szsd
2500 5000
     m = n = k
0
50
100
G
FL
O
PS
/c
or
e
dzss
2500 5000
     m = n = k
0
50
100
dzds
2500 5000
     m = n = k
0
20
40
60
scdd
2500 5000
     m = n = k
0
20
40
60
scsd
Figure 28: Multithreaded (26 threads) performance of “Internal” and “Ad-hoc” implementations of gemm for all
precision combinations within mixed-domain Cases 1a (top) and 1b (bottom) on an Intel Xeon Platinum 8167M pro-
cessor. The 16 graphs on the left side and right sides report computation in single- and double-precision, respectively.
The theoretical peak performance coincides with the top of each graph.
42
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
sccs
2500 5000
0
50
100
sczs
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
20
40
60
dzzd
2500 5000
0
20
40
60
dzcd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
szcs
2500 5000
0
50
100
szzs
2500 5000
0
20
40
60
dczd
2500 5000
0
20
40
60
dccd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
dccs
2500 5000
0
50
100
dczs
2500 5000
0
20
40
60
szzd
2500 5000
0
20
40
60
szcd
2500 5000
     m = n = k
0
50
100
G
FL
O
PS
/c
or
e
dzcs
2500 5000
     m = n = k
0
50
100
dzzs
2500 5000
     m = n = k
0
20
40
60
sczd
2500 5000
     m = n = k
0
20
40
60
sccd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
csss
2500 5000
0
50
100
csds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
20
40
60
zddd
2500 5000
0
20
40
60
zdsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
cdss
2500 5000
0
50
100
cdds
2500 5000
0
20
40
60
zsdd
2500 5000
0
20
40
60
zssd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
zsss
2500 5000
0
50
100
zsds
2500 5000
0
20
40
60
cddd
2500 5000
0
20
40
60
cdsd
2500 5000
     m = n = k
0
50
100
G
FL
O
PS
/c
or
e
zdss
2500 5000
     m = n = k
0
50
100
zdds
2500 5000
     m = n = k
0
20
40
60
csdd
2500 5000
     m = n = k
0
20
40
60
cssd
Figure 29: Multithreaded (26 threads) performance of “Internal” and “Ad-hoc” implementations of gemm for
all precision combinations within mixed-domain Cases 2ab (top) and 1c (bottom) on an Intel Xeon Platinum
8167M processor. The 16 graphs on the left side and right sides report computation in single- and double-precision,
respectively. The theoretical peak performance coincides with the top of each graph.
43
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
cscs
2500 5000
0
50
100
cszs
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
20
40
60
zdzd
2500 5000
0
20
40
60
zdcd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
cdcs
2500 5000
0
50
100
cdzs
2500 5000
0
20
40
60
zszd
2500 5000
0
20
40
60
zscd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
zscs
2500 5000
0
50
100
zszs
2500 5000
0
20
40
60
cdzd
2500 5000
0
20
40
60
cdcd
2500 5000
     m = n = k
0
50
100
G
FL
O
PS
/c
or
e
zdcs
2500 5000
     m = n = k
0
50
100
zdzs
2500 5000
     m = n = k
0
20
40
60
cszd
2500 5000
     m = n = k
0
20
40
60
cscd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
ccss
2500 5000
0
50
100
ccds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
20
40
60
zzdd
2500 5000
0
20
40
60
zzsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
czss
2500 5000
0
50
100
czds
2500 5000
0
20
40
60
zcdd
2500 5000
0
20
40
60
zcsd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
zcss
2500 5000
0
50
100
zcds
2500 5000
0
20
40
60
czdd
2500 5000
0
20
40
60
czsd
2500 5000
     m = n = k
0
50
100
G
FL
O
PS
/c
or
e
zzss
2500 5000
     m = n = k
0
50
100
zzds
2500 5000
     m = n = k
0
20
40
60
ccdd
2500 5000
     m = n = k
0
20
40
60
ccsd
Figure 30: Multithreaded (26 threads) performance of “Internal” and “Ad-hoc” implementations of gemm for
all precision combinations within mixed-domain Cases 2bc (top) and 2ac (bottom) on an Intel Xeon Platinum
8167M processor. The 16 graphs on the left side and right sides report computation in single- and double-precision,
respectively. The theoretical peak performance coincides with the top of each graph.
44
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
ssss
2500 5000
0
50
100
ssds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
20
40
60
dddd
2500 5000
0
20
40
60
ddsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
sdss
2500 5000
0
50
100
sdds
2500 5000
0
20
40
60
dsdd
2500 5000
0
20
40
60
dssd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
dsss
2500 5000
0
50
100
dsds
2500 5000
0
20
40
60
sddd
2500 5000
0
20
40
60
sdsd
2500 5000
     m = n = k
0
50
100
G
FL
O
PS
/c
or
e
ddss
2500 5000
     m = n = k
0
50
100
ddds
2500 5000
     m = n = k
0
20
40
60
ssdd
2500 5000
     m = n = k
0
20
40
60
sssd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
cccs
2500 5000
0
50
100
cczs
Ref (cgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
20
40
60
zzzd
2500 5000
0
20
40
60
zzcd
Ref (zgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
czcs
2500 5000
0
50
100
czzs
2500 5000
0
20
40
60
zczd
2500 5000
0
20
40
60
zccd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
zccs
2500 5000
0
50
100
zczs
2500 5000
0
20
40
60
czzd
2500 5000
0
20
40
60
czcd
2500 5000
     m = n = k
0
50
100
G
FL
O
PS
/c
or
e
zzcs
2500 5000
     m = n = k
0
50
100
zzzs
2500 5000
     m = n = k
0
20
40
60
cczd
2500 5000
     m = n = k
0
20
40
60
cccd
Figure 31: Multithreaded (52 threads) performance of “Internal” and “Ad-hoc” implementations of gemm for all
precision combinations within mixed-domain Cases 0 (top) and 3 (bottom) on an Intel Xeon Platinum 8167M pro-
cessor. The 16 graphs on the left side and right sides report computation in single- and double-precision, respectively.
The theoretical peak performance coincides with the top of each graph.
45
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
sscs
2500 5000
0
50
100
sszs
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
20
40
60
ddzd
2500 5000
0
20
40
60
ddcd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
sdcs
2500 5000
0
50
100
sdzs
2500 5000
0
20
40
60
dszd
2500 5000
0
20
40
60
dscd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
dscs
2500 5000
0
50
100
dszs
2500 5000
0
20
40
60
sdzd
2500 5000
0
20
40
60
sdcd
2500 5000
     m = n = k
0
50
100
G
FL
O
PS
/c
or
e
ddcs
2500 5000
     m = n = k
0
50
100
ddzs
2500 5000
     m = n = k
0
20
40
60
sszd
2500 5000
     m = n = k
0
20
40
60
sscd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
scss
2500 5000
0
50
100
scds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
20
40
60
dzdd
2500 5000
0
20
40
60
dzsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
szss
2500 5000
0
50
100
szds
2500 5000
0
20
40
60
dcdd
2500 5000
0
20
40
60
dcsd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
dcss
2500 5000
0
50
100
dcds
2500 5000
0
20
40
60
szdd
2500 5000
0
20
40
60
szsd
2500 5000
     m = n = k
0
50
100
G
FL
O
PS
/c
or
e
dzss
2500 5000
     m = n = k
0
50
100
dzds
2500 5000
     m = n = k
0
20
40
60
scdd
2500 5000
     m = n = k
0
20
40
60
scsd
Figure 32: Multithreaded (52 threads) performance of “Internal” and “Ad-hoc” implementations of gemm for all
precision combinations within mixed-domain Cases 1a (top) and 1b (bottom) on an Intel Xeon Platinum 8167M pro-
cessor. The 16 graphs on the left side and right sides report computation in single- and double-precision, respectively.
The theoretical peak performance coincides with the top of each graph.
46
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
sccs
2500 5000
0
50
100
sczs
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
20
40
60
dzzd
2500 5000
0
20
40
60
dzcd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
szcs
2500 5000
0
50
100
szzs
2500 5000
0
20
40
60
dczd
2500 5000
0
20
40
60
dccd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
dccs
2500 5000
0
50
100
dczs
2500 5000
0
20
40
60
szzd
2500 5000
0
20
40
60
szcd
2500 5000
     m = n = k
0
50
100
G
FL
O
PS
/c
or
e
dzcs
2500 5000
     m = n = k
0
50
100
dzzs
2500 5000
     m = n = k
0
20
40
60
sczd
2500 5000
     m = n = k
0
20
40
60
sccd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
csss
2500 5000
0
50
100
csds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
20
40
60
zddd
2500 5000
0
20
40
60
zdsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
cdss
2500 5000
0
50
100
cdds
2500 5000
0
20
40
60
zsdd
2500 5000
0
20
40
60
zssd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
zsss
2500 5000
0
50
100
zsds
2500 5000
0
20
40
60
cddd
2500 5000
0
20
40
60
cdsd
2500 5000
     m = n = k
0
50
100
G
FL
O
PS
/c
or
e
zdss
2500 5000
     m = n = k
0
50
100
zdds
2500 5000
     m = n = k
0
20
40
60
csdd
2500 5000
     m = n = k
0
20
40
60
cssd
Figure 33: Multithreaded (52 threads) performance of “Internal” and “Ad-hoc” implementations of gemm for
all precision combinations within mixed-domain Cases 2ab (top) and 1c (bottom) on an Intel Xeon Platinum
8167M processor. The 16 graphs on the left side and right sides report computation in single- and double-precision,
respectively. The theoretical peak performance coincides with the top of each graph.
47
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
cscs
2500 5000
0
50
100
cszs
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
20
40
60
zdzd
2500 5000
0
20
40
60
zdcd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
cdcs
2500 5000
0
50
100
cdzs
2500 5000
0
20
40
60
zszd
2500 5000
0
20
40
60
zscd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
zscs
2500 5000
0
50
100
zszs
2500 5000
0
20
40
60
cdzd
2500 5000
0
20
40
60
cdcd
2500 5000
     m = n = k
0
50
100
G
FL
O
PS
/c
or
e
zdcs
2500 5000
     m = n = k
0
50
100
zdzs
2500 5000
     m = n = k
0
20
40
60
cszd
2500 5000
     m = n = k
0
20
40
60
cscd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
ccss
2500 5000
0
50
100
ccds
Ref (sgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
20
40
60
zzdd
2500 5000
0
20
40
60
zzsd
Ref (dgemm)
Intern (+xm)
Intern (-xm)
Ad-hoc
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
czss
2500 5000
0
50
100
czds
2500 5000
0
20
40
60
zcdd
2500 5000
0
20
40
60
zcsd
2500 5000
0
50
100
G
FL
O
PS
/c
or
e
zcss
2500 5000
0
50
100
zcds
2500 5000
0
20
40
60
czdd
2500 5000
0
20
40
60
czsd
2500 5000
     m = n = k
0
50
100
G
FL
O
PS
/c
or
e
zzss
2500 5000
     m = n = k
0
50
100
zzds
2500 5000
     m = n = k
0
20
40
60
ccdd
2500 5000
     m = n = k
0
20
40
60
ccsd
Figure 34: Multithreaded (52 threads) performance of “Internal” and “Ad-hoc” implementations of gemm for
all precision combinations within mixed-domain Cases 2bc (top) and 2ac (bottom) on an Intel Xeon Platinum
8167M processor. The 16 graphs on the left side and right sides report computation in single- and double-precision,
respectively. The theoretical peak performance coincides with the top of each graph.
48
