Many applications, such as PDE based simulations and machine learning, apply / routines to large groups of small matrices. While existing batched APIs provide meaningful speedup for this problem type, a non-canonical data layout enabling crossmatrix vectorization may provide further signi cant speedup. In this paper, we propose a new compact data layout that interleaves matrices in blocks according to the SIMD vector length. We combine this compact data layout with a new interface to / routines that can be used within a hierarchical parallel application. Our layout provides up to 14×, 45×, and 27× speedup against OpenMP loops around optimized , and kernels, respectively, on the Intel Knights Landing architecture. We discuss the compact batched / implementations in two libraries, KokkosKernels and Intel Math Kernel Library. We demonstrate the APIs in a line solver for coupled PDEs. Finally, we present detailed performance analysis of our kernels.
INTRODUCTION
Dense linear algebra subroutines have a long history of standardization [3, 9, 16] , performance optimization [10, 11] and use in applications. While these standards have been foundational in multiple generations of high performance computing, application and architectural changes today require new designs [8, 27] . Several applications, such as PDE based simulations and machine learning, This paper is authored by an employee(s) of the United States Government and is in the public domain. Non-exclusive copying or redistribution is allowed, provided that the article citation is given and the authors and agency are clearly identi ed as its source. SC17, Denver, CO, USA 2017. 978-1-4503-5114-0/17/11. . . $15.00 DOI: 10.1145/3126908.3126941 rely on a large number of linear algebra operations ( / ) applied to very small matrices. The evolution of hardware to allow massive parallelism and increasing vector lengths also impacts the implementation of foundational linear algebra subroutines.
For groups of small matrix problems the community has developed batched approaches, which enhance performance by introducing parallelism over the individual / operations. However, existing batched / implementations based on column or row major data layouts have limited performance for small matrix sizes. For very small sizes there is simply too little data to take full advantage of the SIMD vector length in modern processors. In this work we consider a SIMD-friendly data layout which can take full advantage of the long SIMD length in modern processors for groups of small / operations through cross-matrix vectorization. Performance optimization of such kernels, especially for a large number of small problems, depends on a number of design choices. The key performance optimizations that we explore in this work are data layouts, vectorization, and cache-friendly interface design. To motivate the methods developed in this paper, we approach the problem from the perspective of an entire application: a line solver for coupled PDEs. The impact on the application greatly in uences our design choices. Speci cally, while the community focus on batched kernels has been around xed or variable sized interfaces for batched kernels [8, 18] , GPU implementations [1, 2, 6] , or group based interfaces [29] , we introduce a two-level interface that results in better cache-locality when composing multiple linear algebra kernels. The result is a set of highly e cient, vector-friendly / kernels for small matrices typical in applications. The proposed data layout, two-level interface and implementation is called compact batched / throughout the paper for brevity.
Contributions: The primary contribution of this paper is the introduction of new / kernels based on the compact data layout. The proposed compact / kernels are implemented in two libraries i.e., KokkosKernels 1 and Intel Math Kernel Library (MKL) targeting architecture speci c improvements. While the focus of batched / to date has been on performance, we extend this to the more complex routines and . On Intel Knights Landing architecture (KNL), our implementations of compact general matrix multiplication ( ), triangular matrix solves ( ), and LU factorization with no pivoting ( ), result in up to 14×, 45×, and 27× speedup against OpenMP loops around highly optimized , and kernels, respectively. We also demonstrate the e ciency of the compact batched / kernels by using them in a line solver for coupled PDEs. The compact batched routines provide 2×-6× speedup. Finally, we perform detailed analysis for vector utilization and arithmetic intensity of the kernels based on a customized Intel PIN tool (APEX) [12] .
The rest of the paper is organized as follows. The motivating applications are described rst in Section 2. The compact batched / kernels are described in Section 3. We then demonstrate the strengths of our design with performance results of the kernels (Section 4.1) and the line solver application (Section 4.2). Vector utilization and arithmetic intensity measures are used to analyze the performance using a roo ine model in Section 4.3.
MOTIVATING APPLICATIONS
A fundamental distributed data structure in HPC is a block sparse matrix, implemented in compressed row/column storage (CSR/CSC) formats. A CSR or CSC graph encodes dependencies between blocks, i.e., the graph. Blocks may be one xed size throughout the matrix, or have variable size. In this paper we consider one xed block size b×b. A higher-level data structure can be used to build matrices having variable block sizes from matrices having xed block size [7] . Each block is typically dense.
PDE-based simulations form one class of HPC application using block matrices. The discretization is over a mesh. A block is associated with a mesh entity, such as a node, edge, face, or cell center. The block size is the number of degrees of freedom associated with the mesh entity. For example, a 3D compressible uid dynamics model using the ideal gas model and with all degrees of freedom at the cell center has 5×5 blocks. As another example, a 3D solid mechanics nodal nite element model using the linear elasticity material model has 3×3 blocks. In simulation codes modeling complicated physical phenomena, b can be several tens [21] .
Compared with a point sparse matrix, where a point can be understood as corresponding to a 1×1 block or, in other words, a scalar, a block sparse matrix uses b 2 fewer ordinals to encode the graph. In addition, computations within and between a block and a vector do not require indexing except to compute the o sets to the block quantities. Computations within a block sparse matrix (e.g., an incomplete factorization), between two matrices (e.g., a matrixmatrix multiplication), and between a matrix and a multivector (e.g., matrix-vector product, triangular solve) use many small 1, 2, and 3 subroutine calls. The structure of the computation determines the extent to which these calls may occur in parallel.
In a typical PDE-based application, a block sparse matrix is lled with discretization coe cient values. Then a preconditioner is formed as a function of the matrix. Finally, an iterative linear solver performs a sequence of matrix-vector products and preconditioner applications. One commonly used preconditioner is the line smoother or solver. It arises in a simulation in two or more dimensions in which independent equations are solved along one dimension, either as an approximation within a preconditioner, because of decoupling of time or space scales, or because of a mix of implicit and explicit time integration. For example, Tuminaro et al. [23] solve independent equations in the vertical direction of an ice sheet as the smoother in an algebraic multigrid preconditioner. U. of Minnesota's US3D [4] , NASA's DPLR [28] , and Sandia National Laboratories' SPARC use a line smoother in a xed-point iteration [28] to solve the Navier-Stokes equations for compressible and reacting ow. Lines are formed with the intention that they be approximately orthogonal to the shocks that form in the simulation. Nonhydrostatic atmosphere solvers use the horizontally explicit, vertically implicit (HEVI) time integration method to remove the vertical acoustic wave speed from the time step restriction [25] . Segall et al. [20] solve for pressure and temperature orthogonal to a fault, with no coupling along fault.
In each of these applications, a large number of independent block-tridiagonal matrices are formed. Operations on and with the block-tridiagonal matrices may be performed in parallel. We refer to this kind of parallel work as batch parallelism. Because of the typical topologies of the meshes, the block-tridiagonal matrices often have the same size. There are a number of divide-and-conquer methods to expose parallelism within an operation on or with a single tridiagonal matrix, such as cyclic reduction [22] and pre x product [19] . Each method recursively forms independent smaller problems; the recursion depth can be adjusted. Each method is slightly work ine cient. Thus, if the number of block-tridiagonal matrices times the amount of parallelism within a single block operation is at least a few times greater than the available hardware parallelism, it is optimal to exploit only batch parallelism. If the number is less than the available hardware parallelism, then algorithmic methods can be used with a recursion depth that uses batch parallelism maximally. This paper focuses only on batch parallelism.
COMPACT BLAS/LAPACK
Traditional implementations based on the conventional data layout (either column major or row major dense matrices) have limited performance for small problem sizes. With small matrix sizes there is too little data to take advantage of all of the vector registers and the data is too small to ll the vector registers that are used, resulting in limited bene t from vectorization. For a single operation, performance can be improved through the use of kernels speci cally tuned for the problem size. For example, one can use Just-in-Time (JIT) code-generation [13] . For this can be very e ective in improving performance. It remains to be seen if this approach can be bene cial for a broader set of more complicated or functions. Additionally, a JIT strategy for generating problem-size tuned kernels still does not address the fundamental problem of vector register ll for small problems.
When there are many matrix operations to be performed simultaneously we can consider alternative data layouts that allow the application to bene t from kernel vectorization. The compact data layout, which is the subject of this work, is a SIMD-friendly layout with considerable advantages in performance for groups of many small matrix operations. 
Compact Data Layout
To illustrate the compact layout consider a collection of V·P matrices A, each having the same size, with which we need to perform some operation. Here P is a positive integer and V is the SIMD vector length of the underlying hardware; e.g., for the Intel Advanced Vector Extensions 512 (Intel AVX-512) architecture in double precision V = 8. We identify element (i, j) of matrix m by A(m, i, j).
The compact layout is most easily understood as a modi ed 3D tensor. First, consider a collection of V·P matrices as a 3D tensor. We also use the term workset size or N for V·P. Organizing the data such that m is the fastest index makes vectorization natural even for a very small matrix size. We can replace a scalar operation, e.g., a multiply and add operation,
by a vector operation,
where C(m:m+V-1, i, j) refers to the (i,j) th element of V matrices stored contiguously in memory. Here += and × are applied elementwise. The vector registers can be lled completely if V is equal to the vector register length (SIMD width).
Notice, however, that as P grows large the distance in memory between elements of an individual matrix grows large, decreasing data locality. To remedy this the compact layout organizes the matrices in a packed data structure (packs) whose length is given by the vector length V. Speci cally, the packs A(nV:(n+1)V-1, :, :), n ∈ {0, . . . , P-1}, are each individually organized as 3D tensors, again with the matrix number m as the fastest index. We illustrate the layout for four 3×3 matrices with V=2 in Figure 1 . Pack n+1 is stored subsequently to pack n in memory, for each n. This layout lls SIMD vectors for each instruction in our / kernel while minimizing the distance in memory between elements of the same matrix.
Algorithm 1:
Simpli ed no-transpose, no-transpose 
Implementation
To focus on the details related to the compact layout we will compare kernels written for simpli ed , , and operations. In particular we will ignore matrix strides and scaling operations, as these do not a ect the basic strategy of optimizing the inner kernel for compact / routines. To avoid confusion, we separate notation for matrices A, B and C stored in column major layout from a collection of matrices in compact layout, Ac, Bc and Cc. All matrices have real, double precision. We begin by considering the operation
where × denotes matrix-matrix multiplication. The simpli ed notranspose, no-transpose matrix multiplication is given in Algorithm 1. Methods for optimizing and other level 3 operations for large matrix sizes are well understood [10, 11] . It is outside the scope of this paper to review all of the techniques involved in obtaining high performance for large s. Instead, we brie y review the design of a kernel for a 3×3×3 on a KNL. First we load columns of C and A into vector registers. For the 3×3×3 problem, the entire A and C matrices can be loaded into registers. Figure 2 illustrates the resulting ll of the vector registers. In a no-transpose, no-transpose matrix multiplication, columns of A are scaled by elements of B, and the results are added to columns of C. So, we will not work with vectors of B, but rather we will crawl over B, broadcast elements to vectors and then perform vector fused multiply-adds (FMAs). A nice feature of the FMA instruction in the Intel AVX-512 architecture is the ability to take a memory address as an operand; the FMA instruction performs an implicit broadcast. Thus we do not need to explicitly broadcast elements of B into registers before performing FMAs.
Notice in Figure 2 that every vector instruction will need to be masked since the A and C registers are only partially lled with 3 out of 8 packed elements for an Intel AVX-512 machine in double precision (or worse, 3 out of 16 packed elements in single precision). There are two obvious performance limitations here. First, our Algorithm 2: Compact kernel for a single pack
Algorithm 3: Simpli ed left, lower, no-transpose, non-unit diagonal kernel
theoretical peak is limited to 3/8ths of the core's peak simply due to the low vector ll. Second, masked FMAs have a higher latency than non-masked FMAs. The gure illustrates that a fundamental problem with very small operations is that we don't have enough data to make full use of the CPU core.
We turn our attention now to the collection of matrices stored in compact format and consider the reference compact algorithm given in Algorithm 2, where we isolate a single pack of V matrices for clarity. The key di erence is that in the inner-most loop we are performing the same operation on the same indices of the Ac, Bc and Cc matrices, changing only the matrix index. Since we have stored the matrices in packs of length V within which the matrix number is the fastest index, we can use only vector instructions -loads, stores and FMAs -with no masks. Additionally, the broadcast of B elements is replaced by simple loads since we are performing abstractly scalar operations.
The story is even more dramatic for . Recall that our simpli ed operation solves the equation
for X, where A is a lower triangular M×M matrix and X and B are M×N matrices. overwrites B with the solution X. For matrices stored in standard column major format, this operation is given in Algorithm 3. We see that there are two main components of the operation: a divide step to solve for element B(i,j) at the top of the i-loop, and then a forward substitution step. A typical algorithmic strategy for optimizing this operation for large sizes is to (i) thread over the j-index and (ii) block over the i and ii loops so that the forward substitution can be written as a call. For large sizes the cost of the i and ii blocked that occurs before the ensuing call is relatively small compared to the performance of the large
, and an optimized library can obtain performance that is within 80% of performance for the same sizes. The blocking of the i and ii indexes can be sized to ensure the in the substitution is properly aligned to memory boundaries. However, for very small problems we cannot bene t from the performance of a -based forward substitution. To make matters worse there is absolutely no opportunity for vectorization of the initial divides. Further, the divides are followed by substitutions which cannot be aligned properly to memory boundaries since the beginning of the forward substitution increments with each i index increment.
If we work on our V·P matrices in compact format, we can improve upon this situation dramatically. Consider the reference Algorithm 4: Reference compact left, lower, no-transpose, nonunit diagonal kernel for a single pack
Algorithm 5: Reference compact no pivoting kernel for a single pack
algorithm presented in Algorithm 4, again isolating a single pack for clarity. Notice that we can again replace every operation with a vector instruction, including the division. Also, memory boundaries are determined by the location of the rst element of a pack of matrices and the pack length, resulting in aligned operations regardless of the i or ii index.
Similar to , the bene ts of the compact data layout are considerably more pronounced for than for . However, introduces a new complication: pivoting. In a standard batched / implementation, where kernels work on individual matrix operations, there is no need to consider any di erent pivoting strategy as a standard kernel may be used within the batch parallel environment. However, the primary bene t of a compact layout is the application of exactly the same operation on V matrix problems simultaneously, thus lling the SIMD vectors and using vector instructions. Thus considerable data movement would be required to apply a pivoting LU kernel in a compact layout, as matrices would need to be packed according to their pivoting strategies. To avoid this issue, we consider only non-pivoting LU in this work.
The reference compact with no pivoting is given in Algorithm 5. In an optimized implementation we can replace every operation with a vector instruction and guarantee these vector operations are performed on data that is appropriately aligned to memory boundaries.
3.2.1 Open Source Implementation. KokkosKernels is built on top of the Kokkos library [5] providing performance-portable implementations in di erent architectures. We limit our experiments to Intel Knights Landing architecture in this paper. Following the parallel abstractions in Kokkos, our computational kernels support the following interfaces:
-serial or vector level -a single thread is used in the kernel; -team -a team of threads are cooperatively used in parallel; -procedure -the entire execution space is used in parallel.
In this work, we focus on the implementation of the serial interface that is used in parallel_for in the broader application such as a line solver, and we follow Kokkos' parallel loop scheduling and thread mapping.
SIMD requires aligned data which is packed contiguously along the vector length. Our compact data layout allows aligned memory access, but this might lead our implementation to be hardware speci c. To make our code portable, we use a template vector data type encapsulating vector registers with arithmetic operator overloading. This allows us to reuse scalar / algorithms with the vector data type, which also means that a good scalar code is more likely to be a good vectorized code. We follow the practice described in [14, 24] . Their core idea is to use a highly optimized architecture speci c micro kernel around the loop packing and blocking data for the kernel according to the architecture cache hierarchy. Implemented using the vector data type, the kernel fully exploits SIMD instructions.
Vendor Library. Intel MKL 2018 introduces compact batched
, , and non-pivoting . Intel MKL's initial implementation uses the compact layout and techniques described earlier in this section, loop unrolling, and compiler intrinsics to achieve performant kernels.
PERFORMANCE
We present performance results on the second generation Intel Xeon Phi 7250, code-named Knights Landing (KNL). The processor consists of 34 tiles interconnected by a two-dimensional mesh. Each tile comprises two four-way threaded cores running at 1.40GHz with 1MB of shared L2 cache. The processor core of KNL has a private 32KB L1 data cache and two AVX512 vector units per core. The processor is equipped with 16GB of MCDRAM that provides approximately 480 GB/s of STREAM Triad bandwidth. All benchmark codes used in this section are compiled using the Intel compiler 17.0.1 with -O3 -g options. First we evaluate the performance of the compact data layout and implementations in synthetic benchmarks against OpenMP loops around / kernels, the Intel MKL's batched / [15] , and libxsmm [13] . We then consider a line preconditioner application.
Batched BLAS/LAPACK
In this section we evaluate the performance of the compact batched / implementations for , and by comparing the performance to the standard batched implementation (where available) as well as the use of a parallel_for around the respective function. This section provides a view into the performance of the individual / functions that will be evaluated in the context of block tridiagonal factorization in the next section.
For each function, we evaluate the performance for square matrices with sizes 3, 5, 10, and 15, with a batch size of 16384. These sizes were chosen in collaboration with domain experts for our motivating application. We explain details of the selected batch sizes from the application context in Section 4.2.
Figures 3 and 6 evaluate the performance of the compact layout for . In Figure 3 
Figures 4 and 7 evaluate the performance of the compact layout for
. In Figure 4 we compare the KokkosKernels and MKL Compact with the MKL Batched (MKL Batch) and an OpenMP loop around MKL (MKL OpenMP) calls. In this case we see very large improvements for the compact implementations for all sizes and all core counts. MKL Compact is 12.89×, 7.82×, 4.84×, and 6.33× fast than MKL OpenMP for increasing block sizes, respectively. In Figure 7 we present a heatmap showing the speedups of the MKL Batched and the MKL Compact routines over an OpenMP loop around MKL calls with 68 threads. While MKL's standard batched approach provides 1.2-1.6× improvements over the OpenMP strategy, we see up to 45.2× improvements with the compact layout.
Finally, Figures 5 and 8 evaluate the performance of the compact layout for with no pivoting. In Figure 5 we compare our two compact implementations with an OpenMP loop around MKL (MKL OpenMP) calls. In this case we do not compare against a standard batched implementation for as it is not available in the MKL batched implementation. Similar to the case we see extremely large speedups, up to 10.49×, for all sizes and core counts for compact . In Figure 8 we present a heatmap showing the speedup of the MKL Compact implementation over an OpenMP loop around MKL calls with 68 threads. We see that the compact layout provides 1.7-27.4× improvements for the function depending on matrix and batch sizes. In general, the speedups of the compact implementations signi cantly increase with smaller block sizes. This shows the power of the compact layout to allow kernels to utilize computational units even on such small block sizes. These speedups become even more visible on larger batch sizes, which reduce the costs related to kernel launch overheads.
Line Preconditioner
We consider a block sparse system of equations Ax = b arising from coupled PDEs. The problem is discretized on a domain depicted in Figure 9 and lines are extracted along the k dimension. The standard stationary iterative procedure is applied for preconditioning the problem by splitting A = M − S, where M consists of block tridiagonal matrices corresponding to the extracted lines of elements. At the solution, Ax = b; hence The second line suggests the iteration
A xed number of these iterations may be used as a preconditioner in a Krylov subspace method, such as GMRES or CG. In either approach, M is factorized once per solution of Ax = b, and its factorization is applied multiple times. There are algorithmic variants for parallel tridiagonal solvers mostly based on the divide-andconquer methods [22] . In this study, we do not consider the parallel tridiagonal factorization but use a sequential algorithm solving many tridiagonal systems within parallel_for. Block tridiagonal matrices are extracted and packed in the compact data layout from the block sparse matrix A as illustrated in Figure 10 . The gure shows packing for a vector length of 2 for clarity, where α T 0 00 (α T 1 00 ) is the rst entry in theÂ r block of T 0 (T 1 ). Then, an LU factorization is applied to those block tridiagonal matrices according to Algorithm 6 with batch parallelism. There are two important aspects of this batched tridiagonal factorization. As the blocks are dense, the performance of this setup phase largely depends on e cient use of level 3 / functions. In particular, we evaluate the code on the small block sizes n b = 3, 5, 10 and 15. These small block sizes are typical in scienti c applications, as we previously described. Additionally, the batched block tridiagonal factorization requires application of a sequence of / operations along a block tridiagonal matrix. The standard batched / interface signi cantly limits the performance as it loses data locality after a single parallel batched operation sweeps over blocks. Unlike the standard batch interface, we expose the short and packed batch interface aligned to the hardware vector length; thus, we can fuse packed batch kernels in a sequence within a single parallel loop. It provides building blocks for us to e ciently compose new batched functions that can be used in parallel_for as in Algorithm 6. Thread-bound data between subsequent compact kernel calls will be better utilized by embedding sequential, vector-level, kernels within the broader application. Figure 11 shows the performance of the implementations using the compact layout compared against the hand-tuned reference implementation on the Intel KNL. As compact and are not yet available in the current MKL implementation, we do not evaluate the MKL for the solve phase. It is worth noting that we use the MCDRAM on the KNL as cache. In modern software design, application codes use modules to improve software productivity and our line preconditioner is also provided as a part of a solver module. In this context, relatively small MCDRAM is shared with other components and using MCDRAM as cache is the most plausible testing environment. We also assume that applications decompose the problem domain so that each computing node can hold a block sparse matrix tting into the fast memory. Thus, we use a 128×128× 128 mesh for blocks n b = 3, 5 and a 64 × 64 × 128 mesh is used for blocks n b = 10, 15. This setup generates problems ranging between 6.3 million and 7.9 million unknowns on each node. Preconditioning the problems, 16384 and 4096 block tridiagonal systems are formed and solved for n b = 3, 5 and n b = 10, 15 respectively.
We compare the vectorized implementations of the preconditioner based on the compact data layout against a line solver from SPARC, a massively parallel computational uid dynamics code developed by Sandia National Laboratories. SPARC has a straightforward implementation of the line smoother. It does not use a compact layout, and it relies on the compiler to vectorize loops. It uses template specializations for block sizes of primary interest. In this study, we have used specializations for n b = 5, 15 but not for n b = 3, 10.
As shown in the gure, substantial performance improvements -6.03×, 2.23×, 11.42× and 5.8× speed-up for block sizes 3, 5, 10 and 15, respectively, in the factorization step -are obtained by vectorizing the code with the compact data layout compared with hand-tuned version of the code with the conventional data layout.
In particular, our code shows signi cant speed-up for small block matrices, which are considered di cult to solve e ciently using conventional optimization techniques. Furthermore, there are signi cant speed-ups due to the vector-level kernel interface enabling multiple small batch calls in a single parallel loop when compared to the standard batched interfaces. Our fused implementation achieves ∼1.4× median speed-up over the implementation using the standard batched / interface on a compact layout over all the test instances. KokkosKernels also provides vector-level packed kernel interface, demonstrating better speed-ups when compared to the native solves.
Performance Analysis
To better understand the performance of the various kernel implementations on the Intel KNL we use the recently developed an application characterization tool (APEX) described in [12] . APEX is a customized Intel PIN [17] tool which performs dynamic instruction, memory operation, arithmetic, control ow and logical operation analysis on executing multi-threaded binaries to extract low-level performance characteristics and operation counts. Several aspects of the Intel KNL core and the AVX512 vector units make instruction analysis of executing applications particularly challenging. These include: (1) the ability of the vector units to mask out lanes when performing memory, arithmetic and logical operations, thus a ecting operation counting; (2) the signi cant increase in the capabilities of the AVX512 vector units in terms of operations and datatype support; (3) memory gather/scatter operations and, nally, (4) the increased parallelism available, placing pressure on any analysis tool to scale to signi cantly higher thread counts than previous Xeon-based processor designs. It is worth noting for the reader that the publicly exposed performance counters available on KNL cores cannot provide the level of detail exposed in APEX or the level of detail required for the analysis described here.
The APEX toolkit is optimized in several ways to provide accurate operation counting at high speed to permit scaling to long duration executions and the size of application binaries used in the production computing environment at Sandia National Laboratories (typically hundreds of megabytes to gigabytes in size). The instrumentation of application instructions is performed in two potential modes. The rst performs analysis of basic blocks within the application, counting instructions which have no masking properties. We call these static operation counts as they will always execute the same number of operations when the instruction is executed. Each basic block is instrumented so that on entry it atomically increments the number of times it has been executed permitting tallies to be maintained quickly and across threads. The second class of instruction presents a greater challenge; these are instructions for which masking operations are used. For these operations an additional handler is installed prior to each instruction execution which traps the masking register/value being used and then performs a population count over the mask to count the number of active entries. We call these dynamic operations as the number of operations is a data dependent property and can change on each execution of the instruction. Complex arithmetic operations such as fused-multiply-adds/subtracts etc. are associated with multipliers to ensure the nal operation tallies match the expectation of the application programmers. The toolkit can be considered to provide an optimistic approach to operation counting as there are possible uses of vector operations that can provide mask-equivalent operations that are not as easily tracked, but, in practice, we have found good correlation to algorithmic and hardware counters (where comparison is applicable) for a number of test cases that we have used from practical and contrived code examples during the tool's development. For the purposes of this analysis we provide a clear distinction between arithmetic oating-point operations from those which still utilize the vector units but do not perform mathematical operations (such as vector comparisons, logical operations and load-/store operations); instead, we count each of these classes separately to ensure a more accurate representation of application behavior. An important aspect to the operation counts supplied here is that they are as instrumented and perceived by the executing processor and are the subject of code generation and optimization, thus, di erences between programmer estimates and nal pro le-based operation and instruction tallies are not uncommon once inlining, unrolling and vectorization (or the lack of vectorization) take place.
Using the APEX analysis tool, we have been able to capture a broad range of low-level application behavior metrics. By using a subset of these, we have been able to capture the average vector utilization of di erent kernels and formulate Roo ine Model [26] diagrams of kernel behavior (see Figure 12 ) to show how the various kernel implementations utilize the hardware resources of the Intel KNL processor. Table 1 gives the average vector lane utilization per oating-point arithmetic instruction for the various implementations of , , and . For the purposes of this analysis we de ne the vector utilization metric as the summation of all oating point arithmetic operations (including those which execute as scalar (i.e. they execute in the 0th lane of the vector unit, as well as with and without masking applied) divided by the number of instructions which execute oating point arithmetic. While we include legacy X87-based instructions in this count, these are virtually never generated by modern Intel compilers and so can be considered to be either zero or an insigni cant part of the operation count. Each kernel can have a maximum utilization of at most 16 double precision oating point operations per arithmetic instruction which would result from 8 double-precision vector lanes and the ability to perform a fused-multiply-add on each lane (giving 2 operations per vector lane per instruction). MKL Compact and KokkosKernels achieve the best vector utilization overall. The vector utilization of all methods is higher with increasing block size as there is an increase in operands available to pack into each vector instruction. While all methods achieve decent vector utilization for , the compact kernels achieve close to 16 for all sizes, beating the other methods. However, utilization for the non-compact methods is much lower for and , consistent with our earlier analysis, while the compact layout allows the use of full vector operations for these more complicated functions. As a result, the performance di erence for and is larger, as shown in Figure 4 and Figure 5 . These results mostly correlate with the performance achieved, with the exception of libxsmm, which is able to provide strong performance at relatively lower levels of vector intensity. Although it has lower vector utilization, it has higher arithmetic intensity (indicating reduced load/store operations and higher register use) as described by the kernel's roo ine model analysis.
The roo ine performance model diagrams of the three kernels are shown in Figure 12 when using 68 threads and N = 16384. In order to generate the arithmetic intensity used for these plots we pro le all oating-point arithmetic operations/instructions (also required for the vector utilization metric above), as well as all load-/store/gather/scatter operations. Although the toolkit tracks data movement between registers, we explicitly exclude this in our arithmetic intensity calculation since we are interested in the bandwidth requirements and data movement across the processor. Movements between vector registers are exceptionally fast versus loads/stores even from local data caches to the point that we regard them as free in the context of investigating broader hardware performance and bottlenecks. The rst conclusion to note is the strong correlation with MCDRAM performance for most smaller kernel executions which cluster either at or close to the MCDRAM bandwidth limit. Increasing block sizes help to push the kernel performance over the MCDRAM bandwidth line and closer to the L2 bandwidth limit as the increased number of operands permits higher vector utilization (permitting more e cient load/stores), and, allows the compiler and processor to keep a greater number of loaded values in registers or cache, thereby reducing cache/memory accesses and access times for each block, increasing achieved performance. For all kernels shown, the use of compact memory layouts (MKL Compact and KokkosKernels) provides the highest performance which we attribute to the greater levels of e ciency resulting from full vector utilization, as well as, operation independence (since each function is performed on independent operands in each vector lane), reducing the overheads of managing kernel execution and allowing for very e cient load/stores of operands at full vector-widths. We argue that these e ects combine to reduce the total number of instructions required to compute the kernels over all operands and provide the processor with a much more e cient instruction stream that presents itself as higher achieved performance.
Although we have gone to considerable lengths in our code design and the discussion in this paper to convince the reader of our increased use of SIMD operations (as re ected in Table 1 ), the Roo ine models may be interpreted as counter to this discussion. The reality is that the roo ine limits (shown as blue lines/labels in our plots) show the peak the hardware is capable of in terms of instructions per second and clock rates in the absence of other micro-architectural bottlenecks or instruction mixes which contain non-oating point arithmetic operations or register/operand dependencies. The result of such e ects is shown in our roo ine plots on memkind (version 20160811) A.2.5 Datasets. Our tests generate small block matrices of different block sizes and work set sizes. These matrices are used in the testing of our kernels. Our preconditioner tests generate blockdiagonal matrices where block-diagonal is a tridiagonal matrix with small blocks for each entry in it. We use this matrix to evaluate the preconditioner creation and application. Data layout is an important factor in tests such as these. As our applications can switch to compact layouts with a small change (a template parameter) it is customary for them to store the data in a format that is best suitable for performance. Our tests store the input in the compact layout. Even if there was an application where it is not possible to store the data in compact layouts, the cost of allocation and copying into the compact data layout can be amortized over several preconditioner creation and solves. We avoid such expensive reformatting and store the data in compact layouts.
A.3 Installation
Installation of the open-source implementation of compact batched / uses a simple Make le system. Users can provide the path to Kokkos installation and get the batched / libraries built. We build the Kokkos library with the con guration options "-with-openmp -with-serial -arch=KNL -with-options= aggressive_vectorization". All our code is compiled with -O3 -g options to the Intel C++ compiler. The vendor library is provided in binary form and linked to our tests.
A.4 Evaluation and expected result
The tests output the average and maximum GFLOPS/sec for 100 iterations of the kernel. The preconditioner tests output the time for factorization and solves. We present the number of factorizations/solve per second in Figure 11 so that the di erences between competing methods can be easily seen. The GFLOPS is a simple analytical conversion for the number of matrices and block size parameters for a given number of factorization or solves.
A.5 Experiment customization
All the experiments in the paper uses a technique called "cold cache". In each run, we ush the small matrices out of the cache by allocating and initializing a large dataset that ushes the cache. This is very conservative estimate of the performance of our kernels. In typical usage such as our line preconditioner the data resides in cache for di erent kernels due to dependencies. The standard way to run our tests still uses the cold cache mode. However, we provide an option to evaluate the kernels in "hot cache" mode. This option demonstrates the improved performance that can be achieved if the data is reused between di erent kernels. For example, when using cold cache for each iteration kernel with 68 threads will result in 55.3 GFLOPs for block size of 5 and work set size of 16384 (See Figure 3) . Users interested in hot cache approach could run our tests with "-hot-cache" and evaluate the improved performance. For example, kernel with 68 threads will result in 113.8 GFLOPs for block size of 5 and work set size of 16384 with hot cache. The true GFLOPs for an application reusing the data between di erent kernels will typically be bounded by the hot and cold cache numbers. We report the conservative estimate in the paper, but provide an option to generate the other case.
