Batched dense linear algebra kernels are becoming ubiquitous in scientific applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.), only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes (up to 256) using single and multiple GPUs. By deploying recursive formulations, stressing the register usage, maintaining data locality, reducing threads synchronization, and fusing successive kernel calls, the new batched kernels outperform existing state-of-the-art implementations.
INTRODUCTION
The Basic Linear Algebra Subprograms (BLAS) (Dongarra et al. 1988 ) and the Linear Algebra Package (LAPACK) (Anderson et al. 1999) libraries have been the foundation of the high-performance computing (HPC) software stack chain for decades. In fact, most of today's scientific applications rely on a highly optimized BLAS implementation, often provisioned by vendor chip manufacturers, to extract performance from the underlying processing units. These architecture-dependent libraries (e.g., available in the Intel MKL (Intel 2017) for x86 or the NVIDIA cuBLAS/cuSOLVER libraries (NVIDIA 2017a (NVIDIA , 2017c for GPUs) may allow applications developers to achieve close to bandwidth or floating-point performance sustained peak for memory-bound or compute-bound kernel workloads, respectively, across various hardware systems.
We target some of the Level 3 BLAS triangular operations (symmetric rank-k update, triangular matrix-matrix multiplication, and solve) as well as some triangular matrix computations from LAPACK (Cholesky factorization, corresponding linear solvers, and matrix inversions). Some of our batched LAPACK functions reuse internally our batched BLAS kernels in a multi-dimensional recursion fashion and create further optimization opportunities by fusing the batched BLAS inner kernels for better performance. Our new implementations of single and nested batched kernels outperform existing state-of-the-art open source and commercial implementations on GPUs.
The remainder of the article is organized as follows. Section 2 presents related work. Section 3 outlines the operations currently supported in the KBLAS library. Section 4 recalls the challenges of designing triangular BLAS and LAPACK operations. Section 5 describes the fundamental algorithmic techniques to extract performance when dealing with a batch of matrix operations of very small sizes. The implementation details of the various batched GPU kernels are given in Section 6. Section 7 provides the performance results of various herein introduced batched triangular dense linear algebra operations on GPUs and compares them against existing state-of-the-art implementations. We conclude in Section 8.
RELATED WORK
The need for batched operations on BLAS routines-as well as on LAPACK routines-has been under thorough investigation lately . The effectiveness of batched operations has been demonstrated in several applications (Abdelfattah et al. 2016a; King et al. 2014 ). Both vendors and academic institutions have contributed batched operations in their library recent releases. For instance, NVIDIA and Intel provide only a subset of batched operations in cuBLAS (NVIDIA 2017a) and the MKL (Intel 2017 )/LIBXSMM (Heinecke et al. 2016 ) libraries, respectively, while MAGMA (MAGMA 2017) from the University of Tennessee supports further batched BLAS and LAPACK kernel operations.
In particular, batched GEMM has probably drawn the most attention, since it is ubiquitous in emerging wide range of applications, e.g., related to the convolution operations in deep learning. cuBLAS (NVIDIA 2017a) provides a batched GEMM implementation on NVIDIA GPUs with both strided and non-strided forms. MKL (Intel 2017) provides an implementation for batched GEMM on Intel CPUs, in addition to the LIBXSMM (Heinecke et al. 2016 ) library, which is based on low-level code generation. MAGMA also provides an implementation for batched GEMM on both NVIDIA GPUs and on CPUs (Abdelfattah et al. 2016b; ) for mid-range as well as very small matrix sizes, which are optimized at the register operations level. Besides batched GEMM, MAGMA implements other Level 3 BLAS batched operations, such as triangular solves (TRSM) and symmetric rank-k update (SYRK).
For advanced dense linear algebra algorithms from LAPACK, MAGMA leads the quest for developing high-performance batched matrix factorizations: (1) batched Cholesky factorizations (Dong et al. 2014b ), where three algorithms, i.e., non-blocked, blocked, and recursive blocked, are examined in contrast to the traditional hybrid CPU-GPU based factorization, (2) superseded later by batched Cholesky/LU/QR factorizations (Dong 2015; Dong et al. 2014a; Haidar et al. 2015b Haidar et al. , 2015a Haidar et al. , 2015c Haidar et al. , 2015d using batched BLAS as building blocks, and, more recently, (3) revisited batched Cholesky factorization (Kurzak et al. 2016 ) using an auto-tuning framework. All these aforementioned batched kernels operate on matrices with same sizes. In Abdelfattah et al. (2016e, 2016c Abdelfattah et al. (2016e, , 2016d , a newer implementation in MAGMA is proposed for handling batched matrix factorizations with variable sizes, which has also been of interest in the context of accelerating sparse linear algebra (SuiteSparse 2017) during the Schur complement calculations. More recently, some of the authors have proposed new batched QR and SVD kernels for very small matrix sizes with applications in the compression of hierarchical matrices (Boukaram et al. 2017) .
In this article, we focus our attention on batched Level 3 BLAS operations that involve triangular matrices (i.e., TRSM, SYRK, TRMM). We then derive several batched LAPACK algorithms that involve symmetric positive definite (SPD) matrices (Cholesky based factorization, solve, and inversion) using these aforementioned Level 3 BLAS triangular kernels. We have previously discussed and evaluated alternative formulations for the standard (non-batched) Level 3 BLAS triangular operations on large matrix sizes (Charara et al. 2016a (Charara et al. , 2016b and have shown that recursive formulations may provide superior performance by optimizing the memory access patterns. Herein, we aim at extending this strategy to very small matrix sizes and improving parallel performance by means of batched operations, with a careful treatment on memory accesses. Moreover, at the time of writing, we only consider uniform sizes, which may be critical for accelerating lowrank matrix approximations and arithmetic during the preconditioning phase of sparse iterative solvers. KBLAS 1 is an open source library providing highly optimized kernel implementations of a subset of BLAS operations on NVIDIA GPUs (Abdelfattah et al. 2012 Charara et al. 2016a Charara et al. , 2016b . All four standard precisions are supported.
THE KBLAS LIBRARY

Current Features
KBLAS currently provides support for a subset of the Level 2 and 3 BLAS kernels, i.e., the general and symmetric matrix-vector multiply (GEMV and SYMV) as well as the triangular matrix multiply and triangular solve with multiple right-hand sides (TRMM and TRSM), respectively, on single and multiple GPUs. The single GPU variants of these aforementioned Level 2 and 3 BLAS routines have been integrated into NVIDIA's cuBLAS library version 6.0 and 8.0, respectively (NVIDIA 2017b). We also provide support for matrix-matrix multiplication GEMM on multiple GPUs.
Kernel API
We try to be consistent in the KBLAS API with the standard/legacy BLAS API for easy code integration. This is especially true for single GPU routines, which operate on the default CUDA stream and thus are considered synchronous ones. Corresponding asynchronous routines are provided, which accept a CUDA stream as a parameter and are suffixed with _async. Similarly, all KBLAS routines are prefixed with kblas_, in accordance with the naming conventions in MAGMA and cuBLAS. Level 2 and 3 BLAS routines assume their parameter data already reside on the device memory, thus, categorized as GPU API. KBLAS provides corresponding CPU API routines that assume data are resident on the host memory and will implicitly handle the data transfer. These routines accept CPU data pointers and are suffixed with _cpu. Routines for multiple GPUs support are suffixed with _mgpu and accept an extra parameter specifying the number of devices to run on, with an optional parameter specifying the device ID to use.
New Features
The new batched Level 3 BLAS kernels supported in KBLAS and targeted in this article are TRSM, TRMM, and the symmetric rank-k update kernel SYRK. We further provide the following batched operations of high-level LAPACK/DLA, which inherently depend on batched TRSM, TRMM, and SYRK: the Cholesky factorization (POTRF), the positive definite triangular solve (POTRS), the solution to system of linear equations with positive definite matrix (POSV), the inverse of a real upper or lower triangular matrix (TRTRI), the product of upper or lower triangular matrix with its transpose (LAUUM), the inverse of a real symmetric positive definite matrix (POTRI), and a new LAPACK routine resulting from the merge of the Cholesky factorization and inversion of symmetric positive definite matrices (POTI).
These new batched routines follow the naming and API conventions adopted in MAGMA/ cuBLAS, and are suffixed with _batch. Besides the legacy BLAS and LAPACK parameters, batched routines have two flavors: one accepting an array of pointers to the batch of matrices assuming the input data are scattered in memory and another that assumes the data have contiguous memory allocation, thus, accepting a single memory address pointer and a stride value between consecutive input matrices.
LIMITATIONS OF TRIANGULAR DLA OPERATIONS
It is critical, for adequately designing the needed batched CUDA kernels, to study, understand, and expose the level of parallelism inherent in the triangular DLA operations. In fact, the data dependency in the algorithm of each operation dictates the degree of parallelism of such operation and, consequently, may limit or enrich the level of parallelism. In the context of our targeted triangular DLA operations, two key factors play the main role in determining the level of inherent parallelism: the triangular matrix shape and in-place (IP) nature of the operation.
The in-place nature of triangular DLA operations encounters some data read-write dependency hazards, since computing some elements need the values of other elements before the latter get updated (Write-After-Read or WAR hazard) or after they get updated (Read-After-Write or RAW hazard). Such dependencies limit the parallelism inherent in the DLA operation and may allow processing only one row or column of elements at a time. To alleviate these data-dependency impacts on parallelism, one option is to use an out-of-place (OOP) variant at the cost of extra memory allocations and data transfers and, therefore, limiting the problem size that can be processed within the limited GPU memory, but most importantly, not conforming to the legacy BLAS API.
The other key factor, i.e., the triangular or symmetric shape of involved matrices, although only half of the matrix elements are processed, engenders load imbalance among threads that compute in parallel a row or a column of entries, and they may have to diverge as they cross the diagonal entries. Since threads within a CUDA warp execute the same instructions in a lock-step fashion, thread divergence within a warp is expensive, because it forces diverging threads to idle status while other lock-step synchronized threads compute, thus wasting valuable core cycles. Abdelfattah et al. (2016d) design two variants of batched Cholesky factorization to cope with the triangular nature of the matrix: loop-inclusive and loop-exclusive. In the first, all factorization iterations are executed in one kernel to maximize chances of data reuse. In the latter, each iteration is executed in a separate kernel launch to optimize resource utilization.
Instead, our approach to remedy the effect of both factors, triangular shapes and in-place nature, is based on a holistic set of techniques to increase parallelism while enhancing memory accesses: (a) recursive formulations, with multi-dimensional recursion, which help optimize resource allocation and relieve WAR and RAW data-dependency hazards; (b) register blocking, which relaxes shared memory limitation constraint and operates at faster register memory; (c) fused kernels, which improve data reuse; and (d) nested batches, which increase concurrency in computation.
FUNDAMENTAL ALGORITHMIC TECHNIQUES
In this section, we describe the techniques to improve the performance of batched triangular BLAS and LAPACK operations for very small sizes on GPUs. Although these operations differ in their processing algorithms and their outcome, they share a common trait, that is, involving triangular matrices. We thus operate on them in a uniform approach using the following range distinction for very small matrix sizes: (1) for matrix size up to 16 we design highly optimized CUDA kernels and (2) for matrix size beyond 16 and up to 256, the CPU driver orchestrates the recursion formulation by launching these CUDA kernels and then stops at the diagonal blocks of size 16, on which we apply the CUDA kernels from (1).
Recursive Blocking
Upon processing matrices of medium to large sizes, blocking is a necessary practice, because such matrices cannot fit in the small (first-level) caches available in modern CPUs or GPUs. Blocking helps in stressing the hierarchical memory of both CPU and GPU architectures.
In a recent work (Charara et al. 2016a (Charara et al. , 2016b , we have illustrated, for large matrix sizes, how recursive formulations reduce data transfer across the GPU memory hierarchy and increase parallel performance, when applied to triangular Level 3 BLAS operations, by casting most of the computations in terms of GEMM operations.
Such blocking may be not necessary when processing small matrices that fit in cache. However, when processing a large number of small matrices, caches may quickly saturate, and one should still consider a form of blocking for this case. Therefore, we extend our previous technique for handling batched operations on small matrix sizes. Figure 1 illustrates the successive steps for the batched POTRF operation using recursive blocking. In our design, the recursive formulations resort at the base case-for processing off-diagonal blocks-to highly optimized CUDA kernels, which store and operate on data at the level of registers, as shown in Figure 1 Note that the binary recursive formulations we employ here are fundamentally different from unary recursive formulations (otherwise known as the block algorithm), where a panel with predefined width is factorized followed by an update to the right side, then a unary recursion is applied to the right side. Our binary recursive formulation starts by splitting the triangular matrix at (usually) half size and then operates recursively on each side either independently or sequentially based on the involved triangular operation.
Register Hosted Computations
Similarly to the CPU memory hierarchy, an NVIDIA GPU features memory hierarchy structured as follows, mentioned in their ascending order of access speed: (1) a device on-chip memory that resides on the card, with high bandwidth (relative to SDRAM), which is a few GBs in size; (2) a non-programmable on-chip L2-cache that is shared among all Streaming Multiprocessors (SM); (3) a local non-programmable L1/read-only texture cache dedicated to each SM, with a (usually 64KB) programmable shared memory; and (4) a 32-bit register file accessible by threads running on the SM's cores.
Note that registers are the fastest memory on which it is desirable to make direct computations. Note also that CUDA imposes a limitation on how much a thread block (TB) can consume of shared memory. In our case, since the processing of each matrix (of the batched operations) is independent of the others and no data re-use is possible across the input matrices, it would be necessary to cache each matrix into shared memory to operate on it. However, saturating the shared memory per TB prevents other TBs from loading into the same SM and, consequently, sharply decreases the occupancy of the SMs and prevents proper latency hiding. For this reason, we adopt a different strategy, which relies on caching data in registers only. Register caching as we describe below has been evaluated and promoted by Falch and Elster (2014) .
Recall that recursive formulations-explained above-convert off-diagonal block processing into batched GEMM calls, while diagonal block processing is handled by our CUDA implementations for the BLAS or LAPACK batched kernels. We design highly optimized kernels that operate on tiny diagonal blocks (up to 16 in size). We fit the data on the registers of the threads cooperating to perform each matrix operation. Two key advantages can be mentioned for fitting data in registers only. First, we avoid-in most of the kernel implementations-the need for more expensive shared memory accesses. Second, we avoid saturating shared memory and, consequently, avoid limiting the number of concurrent TBs executing on the same SM. However, since the register file is shared among the executing threads, saturating register usage may also limit the number of concurrently executing TBs in an SM. For that reason, careful treatment and pressure on registers is needed to avoid register spilling into much slower local memory (which is hosted on the device memory).
However, loading data only in registers limits the possibilities of sharing data among threads (data sharing is possible only between threads of the same warp), unless we share data through shared memory, which we are trying to avoid. For that reason, we make sure that processing each matrix is limited to threads of the same warp. Whenever data sharing is needed among threads processing the same matrix, we utilize the shuffle instruction. The shuffle instruction allows registers of the same warp to share their values in four possible ways: shuffle up, shuffle down, butterfly, and indexed shuffle (NVIDIA 2017d). It is also possible to share data among sub-groups of the warp threads by defining the shuffle width, which allows us, by specializing thread sub-groups within a warp, to process multiple matrix operations with one warp. Moreover, processing each matrix operation within one warp provides the extra advantage of removing any artificial synchronizations, since threads in a warp are already lock-step synchronized. By avoiding synchronizations, the warp scheduler can do a better job hiding the data fetching latency, which becomes the main bottleneck. All in all, by using register blocking and shuffle instruction, we avoid shared-memory limitations and latency, can share data between threads at the cost of no extra cycles, and reduce synchronizations among warps.
Nested Batching Calls
As explained above, recursion breaks the computation of triangular BLAS or LAPACK routines into a set of sub-operations involving diagonal or off-diagonal matrix blocks. In some operations, e.g., SYRK or TRTRI, computation of some or all of the diagonal or off-diagonal blocks may be independent and thus may be computed in parallel, resulting in an additional level of parallelism, referred to as nested batching calls. Indeed, while a batched sub-call operates on the same corresponding sub-block of each input matrix, a nested batching call can combine several sub-calls to batched routines operating on several sub-blocks-from the same recursion level-into one batched call with 2D thread-block (TB) grid or by issuing parallel batched calls on multiple CUDA streams. Figure 2 illustrates this concept for the batched SYRK. This is especially effective when the number of batched matrices does not provide enough workload to saturate the GPU occupancy needed to utilize all the GPU cores and to hide data fetching latency. This low-occupancy situation may be encountered when running weak scaling experiments on multiple GPUs and nested batching calls may remedy such a bottleneck by further increasing concurrency. 
Kernel Fusion
Processing thousands of matrices with very small sizes concurrently in batched mode may still fall into the category of a memory-bound operation, because arithmetic intensity is very low, especially when operating on hardware with over-provisioned Flops. It would be, therefore, of high importance to avoid data transfer, whenever possible, when designing our kernels. Indeed, particular attention should be made when two or more kernels operate successively on the same data block. As outlined above, we designed our kernels to operate on data that fits in registers. Thus, instead of launching multiple kernels, with its associated kernel launch overhead, and reading/writing the data multiple times, we fuse the multiple kernels into one code that performs all the corresponding operations in-place, as long as the data fits in registers and the successive operations occur on the same data block. As an example, this is the case when solving a system of linear equations (POTRS), which consists of two consecutive triangular solves.
HIGH-PERFORMANCE IMPLEMENTATION DETAILS
In this section, we present a detailed description of the implementations of the batched triangular operations, based on the guidelines outlined in Section 5. There are two main components for the studied batched kernels: the recursive formulations and the required techniques for highly optimized CUDA kernels operating on tiny matrix sizes, which are invoked at the recursion stopping criterion.
Pseudocode Samples
To illustrate the concepts described in Section 5, we provide sample pseudocode for some of the supported batched routines that are representative of the others. Concepts applied in these pseudocode snippets are detailed in the following sections. Source codes of all routines are available in the KBLAS open-source library (www.github.com/ecrc/kblas-gpu). In the following code snippets, input validation, error checking, and error reporting are omitted for simplicity.
Algorithm 1 describes the recursive Cholesky factorization CPU driver routine, which calls Algorithm 2 CUDA kernel to factorize tiny diagonal blocks at the recursion base case. It also calls Algorithm 3 that implements recursive TRSM, which in turn calls Algorithm 4 CUDA kernel to solve for blocks involving diagonal blocks. For brevity, Algorithm 1 and Algorithm 2 handle the lower case of Cholesky factorization. The upper case can be described accordingly. Similarly, Algorithm 3 and Algorithm 4 handle the right/lower/transpose case of TRSM. Other variants can be described accordingly.
ALGORITHM 1: RecPOTRF_Batch(uplo, n, batchCount, A_array, row_offset, col_offset, lda) Input: An array of pointers (A_array) to SPD matrices to be factorized, of size n each, with leading dimension Recursive Implementation. The recursive algorithms for triangular matrices are generally described as follows. Assuming the lower case, split the triangular matrix A into three sub-matrices: an upper-left triangular matrix A1, a lower-left rectangular matrix A2, and a lower-right triangular matrix A3, as illustrated in Table 1 . We split matrix B, if involved, at a corresponding index, into B1 and B2. This is translated as splitting the rows of B for left-sided operations and the columns of B for right-sided ones. The split is preferred to be at an index that is a power of 2, since this produces fewer clean-up trailing matrices and, thus, helps generate better performance; however, to avoid operating on thin matrices produced upon encountering sizes that are slightly bigger than powers of 2, one may split at a power of 2 that is not the closest. We proceed by applying the same recursive scheme to the left side of the recursion involving A1, applying any middle updates ALGORITHM 2: POTRF_Batch_CUDA_kernel(n, A_array, lda, row_offset, col_offset) Input: An array of pointers (A_array) to SPD matrices to be factorized, of size n each (n ≤ TX ), with leading dimension lda, and row and column offsets Output: Matrices are factorized in-place involving A2, then applying the recursive scheme to the right side of the recursion involving A3. The recursive scheme for each operation is also detailed in Table 1 . The recursion is stopped upon reaching a size that can be handled by our CUDA kernels as described in Section 6.3 and illustrated in Algorithms 1 and 3.
Multi-Dimensional Recursion. The main advantage of the recursive implementation of the batched DLA kernels is in relaxing the WAR or RAW data-dependency hazards. For the case of batched BLAS operations, the recursive scheme achieves this relaxation by converting the IP BLAS operations into a set of calls to OOP GEMMs. However, the recursive scheme of some LA-PACK operations does not involve direct conversion to GEMMs, as detailed in Table 1 , e.g., in the case of Cholesky factorization (POTRF). To remedy this shortcoming, it is necessary to use multi-dimensional recursion, that is, invoke the recursive formulations of the sub-components of the main recursive call. For example, RecPOTRF invokes itself as well as RecTRSM and RecSYRK, which, in turn, convert into batched GEMM calls, as illustrated in Algorithm 1.
CUDA Kernels
We discuss the design and implementation of the KBLAS CUDA kernels that process batched operations on matrices of tiny sizes (up to 16). Table 1 lists the CUDA kernels that are needed ALGORITHM 3: RecTRSM_Batch(side, uplo, trans, diag, m, n, batchCount, alpha, A_array, A_row_ offsetA, A_col_offset, lda, B_array, B_row_offset, B_col_offset, ldb) Input: An array of pointers (A_array) to triangular matrices, of size n each, and array of pointers (B_array) of rectangular matrices, of size m × n each, with leading dimensions lda and ldb for the corresponding batched operations. Recall our motivation to store and process such tiny matrices in registers, as explained in Section 5.2, to avoid the latency of shared memory, since such data can fit in registers only. However, we need to carefully craft the kernel's pressure on registers to avoid register spillage into much slower local memory. Note that, since the processing of each matrix in the batched operation is independent, there is, generally, enough parallelism in batched operations to involve all the cores of the underlying hardware in the computation, especially when the batch size is large enough. However, that is not enough to extract high performance from the involved operations for several reasons. First, due to the low arithmetic intensity of the targeted operations at very small matrix sizes, these operations are memory bound. Thus, proper utilization of the hierarchical GPU memory structure and bandwidth is needed. Second, since each matrix operation is independent of the other batched operations, each matrix will need to be loaded into shared memory or registers or both to be processed, which brings huge pressure on the low capacity shared memory and registers. The ALGORITHM 4: TRSM_Batch_CUDA_Kernel( m, n, batchCount, alpha, A_array, lda, A_row_offset, A_col_offset, B_array, ldb, B_row_offset, B_col_offset) Input: An array of pointers (A_array) to triangular matrices, of size n each, and array of pointers (B_array) of rectangular matrices, of size m × n each, with leading dimensions lda and ldb respectively Output: Solution of AX = B is stored in matrices B 1 const TGS ← 8 // Thread Group Size (TGS) templatized for various sizes alternative option of accessing the main memory directly from the CUDA threads is not considered due to the inherently slow global memory accesses. Such memory pressure would sharply decrease the occupancy of the GPU and, consequently, decrease the kernel's performance. Third, the triangular or symmetric shape may drive threads to an idle state, while other threads are computing. For these reasons, careful design of the CUDA thread blocks (TB) and TB grid is needed, as discussed in the next subsections.
Thread-Block Design.
Multiple options can be explored for the design of TB and the distribution of computation on multiple threads, some of which has been explored by Oreste et al. (2013) in the context of batched LU factorization. In the next few paragraphs, we explore and evaluate these options and justify our choice of TB design and TB grid layout.
Thread-level Parallelism. In this layout, one CUDA thread computes one operation. Such layout would allow arbitrary matrix sizes to be processed per thread with no idle cycles. However, processing each matrix operation will be serialized being served by one thread. Its main disadvantage is in having to fetch all data directly from main off-chip memory, since neither single thread register capacity nor shared memory capacity would suffice for caching the involved matrices, unless very few threads are involved per TB, which in turn engenders very low occupancy and, consequently, very low performance. Obviously, this approach is not the best configuration we can use.
Warp-level Parallelism. In this layout, one warp computes one operation. Since the parallelism available in all of our target kernels is limited to processing one row (or column) concurrently, due to the IP requirement, this TB layout allows one thread to compute all elements of a column (or row respectively) sequentially. This layout makes better utilization of parallel CUDA threads and may allow data of each matrix to be stored within registers and eventually may allow supporting matrices of size up to 32 in dimension per kernel. However, due to the triangular shape of the matrices, a high percentage of the consumed cycles would be spent by idle threads. Moreover, the register pressure would be very high (demanding 32 × 32 × 2 = 2,048 registers per warp for the double precision case to process a matrix of size 32 × 32), which causes register spillage to the slower local memory and sharply decreases the occupancy. Such configuration is far less than optimal for our purposes.
Thread-block-level Parallelism. One TB (including one or more warps) to process one matrix is also not a suitable choice for the same reasons above. Additionally, only a small number of TBs can run concurrently on an SM by CUDA design, which definitely does not increase occupancy, especially when the batch size is huge.
Thread-group-level Parallelism. We propose and implement this TB design that makes better utilization of the SM resources and minimizes idle thread cycles. We introduce the notion of thread groups (TG). A TG is a subset of threads of a warp that can share register values through the CUDA shuffle instruction. The shuffle instruction can share data between threads within a thread-index range. This range is enforced by CUDA to be a power of 2 (i.e., only possible values are 2, 4, 8, 16, or 32) . Consequently, we define a TG size (TGS) in a corresponding manner (i.e., 4, 8, or 16). For our purposes, a TG would store and compute one matrix operation. The amount of register storage needed is also dictated by the TGS for triangular matrices. We, thus, introduce the constant EPT (elements-per-thread), which is the size of an array statically declared by each thread to store the input/output matrices elements. The EPT for a triangular matrix has to be congruent to TGS; however, EPT for a rectangular matrix can be tuned for optimal register pressure. We also introduce the tunable parameter WPTB, representing the number of warps per TB. Consider one more constant WS representing the warp size (which is fixed by CUDA to 32 but may change with future GPU architectures). With these notions in place, we design our thread blocks as two-dimensional (2D) blocks (TBx, TBy), where vertical dimension TBx = TGS, and horizontal dimension TBy = W PT B × W S / TBx. Consequently, one warp stores and processes multiple matrices. Figure 3 illustrates a typical example case where the matrix dimension of the batched matrices is 8 or less; we use TGS = 8, EPT = 8; consequently, a warp is serving four matrices, in this particular example, (demanding 8 × 8 × 2 × 4 = 512 registers to host four 8 × 8 matrices). The choice of WPTB is empirically influenced by the register demand of the threads, which in turn affect occupancy. The options are limited and range from 1 to 4. Not that these design parameters (TGS, EPT, WPTB) are templatized and automatically selected at runtime based on input matrix size, operation variant, and kernel traits. One may also leverage the tuning framework introduced in Kurzak et al. (2016) , which allows extensive tuning through the Bench-testing Environment for Automated Software Tuning (BEAST) (BEAST 2017) auto-tuning framework. The proposed thread-group parallelism is illustrated in Algorithms 2 and 4.
TB-Grid Design.
CUDA thread blocks can be organized in one-, two-, or threedimensional grids. Given the TB design outlined previously, we use 1D grids for the general cases. The 1D grid size (GS) is affected by two values: the batch size (BS), and the number of operations served by a TB, which is equivalent to TBy. Thus, the grid size is determined by GS = BS/T By . However, there are two cases where we employ 2D grids to enable further optimizations as explained in following paragraphs.
Nested Batching Calls. We discussed in Section 5.3 the concept of nested batching calls. This can be achieved by deploying multiple CUDA streams to execute in parallel the independent batched calls within a recursion level. It can also be achieved in a simpler way by deploying 2D grid of the involved batched call, the x-dimension is used as usual across the batch size, and the y-dimension is used to batch across the parallel computations of sub-blocks from the same matrix. This is the case, for instance, for the kernels KerSYRK and KerTRTRI. Figure 2 illustrates this concept for the Ker-SYRK kernel, applied to the red diagonal blocks, which can be batched within one whole SYRK operation as well as across the batch size. The same can be said about similarly green-shaded blocks.
Enhancing Parallelism for Large Dimensions. In this article, we target matrices with very small sizes, i.e., up to 256 in dimension. However, some operations accept two inputs for dimensions, consider for example the left-sided TRSM operation, it takes as input M, the number of rows of the triangular matrix A, and NRHS, the number of right-hand sides for matrix B. To allow the NRHS to grow beyond the small size, we split the NRHS into chunks, and deploy a 2D grid of TBs, where the x-dimension works across the batch size as usual, while the y-dimension launches TBs across the NRHS chunks. Such arrangement of a 2D grid is needed in KerTRSM, KerTRMM, KerPOTRS, and KerPOSV. We illustrate this configuration in Algorithms 3 (lines 3-7) and 4 (lines 3-12).
Additional Optimization Techniques.
Shared Memory Buffering. In Abdelfattah et al. (2016f) , we introduced the concept of double buffering at the register level. Double buffering is used to hide the latency of fetching data from main GPU memory by pipelining its instructions with other computation instructions. In this work, we employ the same concept by buffering at the shared memory level, since register space is quite tight. We use shared memory buffering for two purposes. The first type of usage is for early fetching of data that are scheduled to be processed next. By the time the current block is processed, the next block is being fetched into the shared memory block in parallel, thanks to the possible pipelining of instructions made possible by the advanced nvcc compiler. When ready, the data at shared memory buffer would be copied to the register buffer for processing, and then the next block is fetched again. Shared memory buffering is needed in the kernels KerTRSM, KerTRMM, 15:16 A. Charara et al. KerPOTRS, and KerPOSV. More important, we use shared memory buffering to avoid non-coalesced memory reads. When fetching data into register blocks, non-coalesced reads may be encountered, especially with transposed cases. To avoid these slow reads, we fetch the data into shared memory with coalesced reads and then transpose the data into register blocks for processing. Such usage of shared-memory buffering is illustrated in Algorithm 4 (lines 29-32).
Kernel Fusion. In Section 5.4, we explain and justify the need and advantage for fusing batched kernels. We note here the operations where we apply batched kernel fusion within KBLAS implementation. Table 2 summarizes the fused batched kernels and the corresponding matrix sizes.
Loop Unrolling. Loop unrolling is an effective optimization technique that is widely used. It enhances instruction-level parallelism (ILP) with the aid of the compiler. Loop unrolling can be automatically applied by the compiler based on the level of optimization flags or by explicitly instructing the compiler to apply it with the special pragmas. Using the nvcc compiler from NVIDIA, loop unrolling can be applied only when the number of iterations of the loop is statically defined at compile time. For this reason, by parameterizing our kernel implementations with TGS, all the internal loops that use TGS may be unrolled.
PERFORMANCE RESULTS AND ANALYSIS
This section highlights the impacts of the optimization techniques on the batched DLA operations for small matrix sizes and compares the performance against existing CPU and GPU implementations (whenever available) on various hardware configurations.
Environment Settings
Experiments reported below have been conducted on single and multiple NVIDIA K40 GPUs across all subsequent GPU performance plots (unless noted otherwise). Since all of our implementations are equivalent in Flop count to the native versions, we report performance results in Flops per second, which are obtained by dividing the theoretical Flop count of each algorithm by the execution time over a range of input sizes. We use CUDA v7.5, Intel compilers v16, and MAGMA v2.2.0. Performance timing does not include data transfer to/from GPU, since data are assumed to reside on the GPU memory, given the low arithmetic intensity of the CUDA kernels. All results are reported with IEEE compliant compilation (without fast-math compiler optimization). Although single and double precision arithmetics (SP and DP) are only shown in the subsequent performance graphs, KBLAS is distributed with a support to all precisions. The considered batch sizes are enough to saturate the device, and we generally use a batch size in SP twice larger than DP to maintain the same amount of data for both studied precisions. 
Performance Gain Breakdown
To properly assess the fundamental techniques introduced in Section 5, we examine the incremental performance gain of each optimization technique.
Recursive Formulations. The recursive formulations play a major role in accelerating triangular operations by converting them into mostly GEMMs of (nearly) square sizes. Figure 4 shows a breakdown of the execution time of each kernel in all the triangular operations, expressed as a percentage of the total time for execution. The figure shows that time spent in GEMM operation, due to recursive formulations, is dominant with respect to the other kernels. For instance, thanks to multi-dimensional recursion, RecPOTRF is decomposed into a set of batched GEMM calls, which, in turn, consumes 77% of the execution time. The performance impact over the non-batched version is demonstrated in Figure 5 for the case of batched STRSM and DTRSM. The curves cuBLAS-xTRSMRec show the effect of applying recursion over the corresponding cuBLAS TRSM, where we call cuBLAS batched GEMM kernel for off-diagonal blocks and cuBLAS batched TRSM kernel for the recursion base case. Since recursion in batched RecTRSM begins at size 32 and beyond, it does not effect at lower sizes. Indeed, the effect of recursion magnifies with the increase in matrix sizes as larger GEMMs are generated, reaching beyond 2.5× speedup at the higher end of the targeted matrix sizes. Similarly, the curves MAGMA-xTRSM-Rec in Figure 6 show the effect of applying recursion over the corresponding MAGMA POTRF, where we call KBLAS batched TRSM and SYRK for off-diagonal blocks and MAGMA batched POTRF kernel for the recursion base case.
Register Hosted Computations. The CUDA kernels-which operate at the register level-are most effective at the tiny sizes (below the recursion base case). Their impact on performance gradually decreases with the increase in matrix size becuase other kernel calls (mainly GEMM) become Figure 5 illustrate the additional gain in performance achieved by replacing the corresponding cuBLAS batched TRSM kernel with a register-hosted KBLAS batched TRSM kernel for the recursion base case, on top of that achieved by recursive formulations. Such gains range from 1.6× and 1.4×, at the upper spectrum of matrix sizes, up to 3.7× and 3.3×, at the lower spectrum, for SP and DP, respectively. The curves KBLAS-xPOTRF in Figure 6 also highlight the additional gain in performance achieved by replacing MAGMA batched POTRF kernel with a register-hosted KBLAS batched POTRF kernel for the recursion base case. Such gains range from 1.1×, at the upper spectrum, up to 2.8× and 4.3×, at lower spectrum, for SP and DP, respectively.
dominant. The curves KBLAS-xTRSM in
Kernel Fusion. Figure 7 shows a sample of performance gain obtained by fusing the two kernels of POTRS running on a K40 GPU with 10, 240 batch size. Note that the gain is maximum (reaching close to 1.5×) when the execution involves the fused CUDA kernels only (for sizes up to the base case); it decreases when recursion takes over by getting invoked on top of the fused kernels, involving GEMM calls for off-diagonal blocks.
Nested Batching Calls. Figure 8 shows the effect of nested batching calls on the batched SYRK kernel. The speedups gained for these sample runs range from 11% and 7% up to 18% and 14%, for SP and DP, respectively. Recall that nested batching calls combine several partial batch calls into one call-when possible-to improve occupancy and minimize kernel launch overhead. This is most effective when the batch size is small, where the workload would not saturate the GPU. Figure 8(a) demonstrates the speedup gained with nested batching calls for a matrix of size 64 while varying the batch size. The plot shows that, indeed, the gain is considerable when the batch size is small. Such gain gradually decreases with the increase in batch size as expected, since each partial batch call would bring enough workload to the GPU device. Figure 8(b) shows the speedup gain as the matrix size changes on a single K40 GPU. Note that the speedup decreases with the increase of matrix size as larger GEMM calls dominate the execution time. However, this technique is still effective in a weak scaling setup. Figure 8(c) shows the speedups gained with nested batching calls on 8 GPUs with batch size of 8K. The distributed load on each GPU is rather small where the nested batching becomes effective, thus exposing further parallelism and resulting in better GPU utilization. The resulting boost in performance reaches up to 22% and 16%, for SP and DP, respectively.
From now on, we report and evaluate the performance of our KBLAS implementations assuming all previously mentioned techniques have been activated.
Performance Comparisons of Batched BLAS Kernels
In this section, we compare our results against MAGMA, cuBLAS, and Kurzak et al. (2016) , whenever the relevant operations are available within such libraries. We report cuBLAS batched GEMM (shown as dashed curves) as a sustained upper-bound for the batched BLAS and DLA operations, as a more realistic bound than the theoretical peak performance of the hardware, given the low arithmetic intensity of our memory-bound batched kernels. These upper-bound curves are not always appropriate for small matrix sizes, due to the additional memory transfers required for the matrix-matrix multiply kernel, but they are still considered valid reference bounds. Furthermore, we do not check on the definiteness property of the matrices in KBLAS (similarly for the implementation of Kurzak et al. (2016) , and we have disabled it in MAGMA for a fair comparison. The overhead for such sanity check is rather minimal, and we intend to have it in KBLAS as well eventually. Figures 9, 10 , and 11 show the performance comparisons of KBLAS batched TRSM, TRMM, and SYRK with both SP and DP against MAGMA and cuBLAS (when available) on single NVIDIA K40 GPU. The performance gain for batched TRSM against MAGMA (Figures 9(a) and (b)) ranges from 1.3× and 1.4× (for relatively small sizes) up to 47× and 43× (for very small sizes) in SP and DP, respectively. Note that our KBLAS RecTRSM implementation is in-place, while MAGMA's implementation is out-of-place (OOP), thus requiring extra memory allocations for the output matrix and a workspace. The speedup ranges from 3.7× and 2.2× to 7.8× and 4.4×, respectively, relative to the in-place cuBLAS implementation.
Single GPU.
However, the gain in performance for batched TRMM (Figures 10(a) and (b)) is high for tiny sizes that are handled directly by KBLAS CUDA kernels, reaching up to 22×. However, since TRMM is basically implemented as a GEMM in MAGMA, speedups for matrix sizes beyond the base case (when recursion is invoked) are minimal and reach up to 1.68× and 1.2× for SP and DP, respectively. Similar observations can be made about the batched SYRK operation, which is also implemented as a batched GEMM in MAGMA. Consequently, speedups for sizes below the base case (32) are high (Figures 11(a) and (b) ), reaching up to 6.3× and 3.7× for SP and DP, respectively, and relatively lower for sizes beyond the base case, reaching up to 1.6× and 1.3× for SP and DP, respectively.
Multiple GPUs.
We benchmark and report performance of our implementations on multiple NVIDIA K40 GPUs. Data are assumed to reside on the devices main memories; consequently, data transfer time is not accounted for. We distribute the data and corresponding operations across devices equally. Figure 12 shows the performance of KBLAS batched DSYRK, DTRSM, and DTRMM on multiple NVIDIA K40 GPUs, for DP arithmetic. With equivalent work-load on the GPU devices, the plots demonstrate a decent strong scaling in performance on 1 to 8 GPU devices. The performance of KBLAS batched Cholesky factorization (POTRF)-shown in Figures 13(a) and (b)-achieves up to 2.6× and 2.9× speedups against MAGMA for very small sizes in SP and DP, respectively, and meets MAGMA's performance at sizes beyond 64. Figure 13(a) shows also the performance of batched POTRF in SP as implemented and reported by Kurzak et al. (2016) . Kurzak's implementation employed an exhaustive tuning approach for each matrix size using the BEAST framework. Note that we extracted their POTRF performance from their corresponding publication (which reports results on a batch of size 10,000 with SP only running on similar hardware), because the authors do not provide a software release. The performance gain against Kurzak's implementation is very close to that against MAGMA.
The positive definite triangular solve (POTRS) is effectively two consecutive triangular solves (TRSMs); consequently, it inherits the massive speedup brought by the RecTRSM kernel. The performance gain, as shown in Figures 14(a) and (b) , ranges from 1.4× and 1.3×, for moderately small matrix sizes, up to 87× and 68×, for very small matrix sizes in SP and DP, respectively. Figure 15 shows performance comparisons of the solution to a system of linear equations with positive definite matrix (POSV), which decomposes into a Cholesky factorization (POTRF) and two consecutive triangular solves (TRSMs). Therefore, it inherits the corresponding performance gains. The speedups of KBLAS batched POSV against MAGMA ranges from 1.2× and 1.3×, for relatively small matrices, up to 56× and 49× for very small matrices in SP and DP, respectively. Last, Figure 16 draws the performance plots of other batched triangular DLA operations in KBLAS (MAGMA does not provide support for these DLA operations). These additional DLA kernels are critical in the context of preconditioners for finite element discretization of elliptic partial differential equations using balancing domain decomposition by constraints (BDDC) algorithm (Dohrmann 
2003; Zampini 2016
). Thanks to the new technique optimizations described in Section 5, KBLAS batched DLA kernels run close to the sustained batched cuBLAS GEMM. Figure 17 shows the performance of KBLAS batched DPOTRF, DPOTRS, DPOSV, DLAUUM, DTRTRI, DPOTRI, and DPOTI on multiple NVIDIA K40 GPUs. Similarly to the setup described in Section 7.3.2, we distribute the load on devices equally. Again, the plots show a decent strong scaling on 1 to 8 GPU devices. Figure 18 demonstrates the performance speedup brought by KBLAS against MAGMA and cuBLAS (whenever available) across various generations of NVIDIA GPUs. Note that no extensive tuning was performed for any of the setups on various GPU architectures. The performance gain is significant for matrices with very small sizes and competitive for larger sizes, across all architectures and batched operations, and therefore emphasizes that our optimization techniques are portable. It is noteworthy that the newly released Pascal P100 architecture brings some major changes in its core technical specifications, e.g., register file size, cores per SM, shared memory capacity, and so on. Minimal tuning on the register hosted computation technique of the KBLAS code may be necessary to further increase the performance gain for very small matrix sizes.
Multiple GPUs.
Portability across GPU Architectures
Performance Profiling
To further justify and assess the performance gains recorded in the previous section, we profile in detail our implementations in regards to data transfer, arithmetic intensity, and bandwidth saturation. We compare the profiling results to those on MAGMA. Flop count and data transfer sizes are measured using NVIDIA profiler (nvprof). Figure 19 shows the ratio of MAGMA performed Flop count and memory transactions to that performed by KBLAS for the equivalent operations and matrix and batch sizes at all memory levels: L1 (including shared memory), L2, and global memory. The plots show that KBLAS is consistently performing many fewer Flops than MAGMA, since MAGMA inverts the diagonal blocks instead for the batched TRSM, POSV, and POTRS kernels and operates on the full diagonal symmetric blocks for the batched SYRK kernel. In addition, KBLAS performs much less data transfer than MAGMA at the targeted matrix sizes in various triangular operations, thanks to the optimization techniques described in Section 5. The plots show also the recorded speedups by KBLAS against MAGMA, which very much align with the ratios of data transfer rates.
Figure 20(a) shows the roofline performance model (Ofenbeck et al. 2014 ) of various KBLAS batched operations based on measured Flop count and data transfer sizes, with a batch size of 10,240 running on NVIDIA K40 GPU on square matrices of sizes 128. The figure shows that performance of KBLAS implementations is very close to the sustained performance of the underlying hardware. Figure 20(b) shows the ratio of achieved bandwidth in GB/s of KBLAS batched operations using various matrix sizes to the sustained bandwidth of the used GPU device. Although floating point performance is way below the theoretical peak of the device, achieved bandwidth ranges from 60% to 80% of the sustained device bandwidth.
Discussions on Non-uniform Matrix Sizes
Performing batched DLA operations on non-uniform matrix sizes may be challenging due to potential load imbalance issues and low hardware occupancy. To tackle these problems, our approach is to leverage the batched kernels introduced herein and use them as building blocks. In fact, their recursive implementations may facilitate this extension, by casting most of the computations in terms of GEMM CUDA calls, which support non-uniform matrix sizes (Abdelfattah et al. 2016b ).
So the challenge is then left upon the implementations of our own CUDA kernels for the diagonal blocks, which currently support only uniform matrix sizes. It turns out that this challenge may be a unique opportunity. Indeed, our recursive implementation would translate the staging of high-level non-uniform batched recursion calls into successive uniform batched recursive calls. This would require at each recursion stage to recollect and redistribute the remaining updates across the pool of transient matrices, until all matrices would reach their final outputs. Therefore, this fine-grained staging technique would permit to maintain all TB busy throughout the computations by ensuring a proper load balancing. Memory padding may be necessary within each stage only for our diagonal block CUDA kernels to handle matrix sizes less than the maximum maxN these kernels can accept. For example, current implementations that accept matrices of size 8 would be extended to handle matrices of any sizes less than 8 using memory padding. Since each thread group (TG) within a CUDA WARP processes one matrix, such TG would be able to process a matrix of size less than maxN by setting EPT = maxN . Early termination mechanism (Abdelfattah et al. 2016b ) may then take over at each recursion stage, in case the individual matrix has been totally processed, while avoiding extra flops. The overhead of the memory footprint is negligible, since the padding size is reevaluated at every stage and limited only to the diagonal CUDA kernels.
Such a natural extension of the techniques we described in this article may bring similar speedups to non-uniform matrix sizes.
CONCLUSION AND FUTURE WORK
This article presents a new set of high-performance batched BLAS and LAPACK/DLA operations on GPUs with support for very small matrix sizes. Thanks to recursive algorithmic formulations and hardware/software-related optimizations (i.e., register pressure, kernel fusions, and nested batching calls), these new GPU batched BLAS and LAPACK kernels outperform existing GPU and CPU implementations. These new kernels represent critical building blocks for designing efficient sparse direct solvers on hardware accelerators (Amestoy et al. 2011; Hénon et al. 2002) as well as in the context of machine-learning applications with low-rank matrix approximations (Akbudak et al. 2017) .
There are many tuning parameters for our CUDA recursive kernels (i.e., WPTB, TGS, EPT, etc.), which may analytically be derived from hardware specifications, such as register file sizes, maximum thread block per symmetric multiprocessor, shared-memory size, and so on. These tuning assumptions may be used as initial ranges to be empirically explored by auto-tuning frameworks such as BEAST (BEAST 2017), which would ease the portability for future GPU hardware generations.
