A batched BLAS design based on device function and big-tile setting was proposed on the GPU.
Introduction
The emergence of multicore and heterogeneous architectures requires many linear algebra algorithms to be redesigned to take advantage of accelerators, such as GPUs. A particularly challenging class of problems, arising in numerous applications, involves the use of linear algebra operations on many small-sized matrices. Their number can be thousands, even millions. For example, billions of 8x8 and 32x32 eigenvalue problems need to be solved in magnetic resonance imaging. Also, thousands of matrix-matrix (GEMM) and matrix-vector products (GEMV) are computed in hydrodynamic simulations with Finite Element Method [1] . Here the size of matrices increases with the order of the numerical methods, and can range from ten to a few hundred. GEMM is at the heart of deep neural network (DNN) computations, where rather than treating convolution as one large GEMM problem, it is much more efficient to view it as many small GEMMs [2] . Thus, Batched BLAS, and Batched GEMM in particular, are central part of performing deep learning, and therefore can be used to accelerate frameworks like PaddlePaddle [3], Theano [4] , TensorFlow [5] , and Torch [6] . In an astrophysics ODE solver [7] , multiple zones are simulated, and each zone corresponds to a small linear system solve based on an LU factorization [7] . If the matrix is symmetric and definite, the problem is reduced to a batched Cholesky factorization [8, 9] . In [10] , there are demands to compute one million 25x8 and n*20 SVD problems, where n ranges from A c c e p t e d M a n u s c r i p t 32 to 4096.
The acceleration of the aforementioned many small-sized linear algebra problems has become extremely challenging for current many-core architectures, and in particular GPUs. To address the wide range of application needs and the parallelization challenges related to them, we developed a number of new batched computing techniques and designed batched BLAS (Basic Linear Algebra Subroutines) routines, and in particular the Level-2 BLAS GEMV and the Level-3 BLAS GEMM routines, to solve them. To describe these developments, we start with other related work and summary of our contributions (in Section 2), followed by algorithmic background (in Section 3) and the Householder Bi-diagonalization algorithm (in Section 4), performance analysis and a roofline model that guides our design and optimizations (in Section 5), the main batched BLAS design, optimization techniques and implementation for GPUs (in Section 6), our auto-tuning strategy (in Section 7), performance on NVIDIA GPUs (in Section 8) and AMD GPUs (Section 9). Finally, conclusions and future work directions are given in Section 10.
Related Work and Contributions
The acceleration of many small-sized linear algebra problems has become extremely challenging for current many-core architectures, and in particular GPUs. Standard interfaces have been proposed for some of these problems, called batched problems, so that they get targeted for optimization and used in a standard way in application directly from highly-optimized, standard numerical libraries, like (batched) BLAS and LAPACK [11] . Indeed, vendors like NVIDIA and Intel started to provide certain batched functionalities in their cuBLAS and MKL libraries, respectively. MAGMA [12, 13] , an open source library, provides the most extended set of batched BLAS and LAPACK functionalities to date. In particular, efficient batched one-sided factorizations (LU, QR, and Cholesky) were developed in [14] [15] [16] [17] , and are now released through MAGMA. These factorizations are compute-bound and rich in Level-3 BLAS operations. Therefore, the main effort in developing them lied in algorithmically enhancing the percentage of Level-3 BLAS operations, using techniques While most of the developments have been for one-sided factorizations and solvers, many important applications -from big data analytics to information retrieval, lowrank approximations for solvers and preconditioners -require two-sided factorizations, and most notably the SVD factorization. To develop them, in contrast to the computebound one-sided factorizations, one must consider the acceleration of the memorybound two-sided Householder bi-diagonalizations (GEBRD). These routines are the most time-consuming part that is needed for the Singular Value Decompositions (SVD) in many applications. Instead of BLAS-3 GEMM, the Householder bi-diagonalization problem is rich in memory-bound Level-2 BLAS GEMV operations. Thus, the goal is to develop efficient batched Level-2 BLAS that minimizes memory transactions and maximizes bandwidth. To accomplish this, we propose a device functions-based methodology and big-tile setting techniques in our batched BLAS designs, in order to facilitate data reuse. The different optimization techniques, as well as the various instances of GEMV to accelerate, result in many software versions that must be tuned, for which we adopt an auto-tuning strategy to automatically derive the optimized instances of the routines.
Throughout this paper, our batched routines are named as MAGMA batched routines, and released through the MAGMA library.
Our main contributions are: 1) Design batched BLAS device functions and kernels, as well as efficient implementations and optimization techniques; 2) Design twosided bi-diagonalization for batched execution based on the batched BLAS approach;
and 3) Port and tune the developments to a number of high-end GPUs, including both NVIDIA and AMD GPUs.
Our batched BLAS can run on Nvidia GPU with a CUDA code version that we extend from [18] , but also on AMD GPUs through a HIP code version that we developed.
HIP programming on AMD GPU is rather new [19] . To our best knowledge, no similar work has been presented before.
A c c e p t e d M a n u s c r i p t
Background
The SVD problem is to find orthogonal matrices U and V , and a diagonal matrix Σ with nonnegative elements, such that A = UΣV T , where A is an m × n matrix. The diagonal elements of Σ are singular values of A, the columns of U are called left singular vectors of A, and the columns of V are called right singular vectors of A. Such problem is solved by a three-phase process:
1. Reduction phase: orthogonal matrices Q and P are applied on both the left and the right side of A to reduce it to a bi-diagonal matrix -hence these are called "two-sided factorizations." 2. Solution phase: a singular value solver further computes the singular values Σ and the left and right vectors U and V T of the bi-diagonal matrix;
3. Back transformation phase: if required, the left and the right singular vectors of A are computed by multiplying U and V T by the orthogonal matrices Q and P used in the reduction phase.
It is well known that the first phase is the most time consuming portion of the SVD problem [20] . Benchmarks show that it consists of more than 70% or 90% of the total time when all singular vectors or only singular values are computed on modern architectures, respectively. For that, we focus in this paper on the reduction phase for a batch of small problems, and study its limitations.
Householder Bi-diagonalization
The bi-diagonalization factorizes A = UBV T , where U and V are orthogonal matrices, and B is bi-diagonal with non-zeros only on the diagonal and upper superdiagonal. This is done by the classic stable Golub-Kahan method that applies a sequence of Householder transformations [21] . Algorithmically, this corresponds to a sequence 
This algorithm is sequential and rich in Level-2 BLAS GEMV routine calls that are applied in every step for updating the rest of the matrix. The blocked two-phase algo- 
Performance Bound Analysis and Roofline Model
In order to evaluate the performance behavior of the reduction to bi-diagonal and to Without loss of generality, A is assumed to be of size n × n. A(i : j, m : k) is the submatrix of A consisting of i-th through j-th row and m-th through k-th column with 0-based indexing. for i ∈ {1, 2, 3, . . . , n/n b } do
Panel Factorize using LABRD to reduce A ix and A iy to bi-diagonal form; returns matrices X, Y to update trailing matrix C i in the next phase; U, V are stored in the factorized part of A.
Trailing Matrix Update
end for cies and produces artificial synchronization points between the panel factorization and the trailing submatrix update. This makes it impossible to overlap the panel and the trailing submatrix update. Therefore, we can model the performance of our algorithm by the performances of its basic kernels (as they have to be executed in order).
The algorithm proceeds by steps of size n b . We give the detailed panel and update costs per step:
• The panel is of size n b columns. The factorization of every column is primarily dominated by two matrix-vector products with the trailing matrix. Thus, the cost of a panel is 4 n b l 2 + Θ(n), where l is the size of the trailing matrix at step i.
For simplicity, we omit Θ(n) and roundup the cost of the panel by the cost of the matrix-vector product;
• The update of the trailing matrix consists of applying the Householder reflectors generated during the panel factorization to the trailing matrix from both the left and the right side using Level-3 BLAS routines: For all steps (n/n b ), the trailing matrix size varies from n to n b by steps of size n b , where l varies from n to n b and k varies from (n − n b ) to 2 n b . Thus, the total cost for the n/n b steps is:
According to the above equation, we derive below the maximum performance P max that can be reached by the bi-diagonal reduction algorithm as a function of the performances P gemm and P gemv for the gemm and gemv kernels, respectively. In particular, for large matrix sizes n:
The performance of the Level-2 BLAS routines such as the batched matrix-vector multiplication (gemv) is memory bound and very low compared to the Level-3 BLAS dgemm. For example, on a K40c GPU the performance of batched dgemv is about 40 Gflop/s as shown in Figure 4a , while for batched dgemm it is about 323 Gflop/s as illustrated in Figure 2 . Thus, one can expect from Equation (1), that the performance of the reduction algorithm is bound by the performance of the Level-2 BLAS operations.
This explains the well known low performance behavior observed for the algorithm.
Batched BLAS Design and Implementation for GPUs
In a batched problem that is based on batched BLAS, many small dense matrices must be factorized simultaneously, meaning that all the matrices will be processed simultaneously by the same kernel.
Two-level Parallelism and Device-kernel Mode
Our batched BLAS kernels do not make any assumption about the layout of the matrices in memory, e.g., the matrices are not necessarily stored continuously. The M a n u s c r i p t starting address of every matrix is stored in an array of pointers, and the batched kernel takes the array of pointers as input. Note that to use the array of pointers interface, extra memory must be allocated as workspace, compared to the assumption of consecutive matrix storage. Inside the kernel, each matrix is assigned to a unique batch ID and processed by one device function. Device functions are low-level and callable only by CUDA or HIP kernels.
The device function only sees a matrix by the batched ID and thus still maintains the same interface as the standard BLAS. Moreover, we use multiple GPU threads per matrix factorization, which is different from [22] , where only one thread is used.
Thus, our batched BLAS is characterized by two levels of parallelism. The first level is a task-level parallelism among the independent matrices that are simultaneously processed. The second is fine-grained data parallelism within the computation of each matrix and its goal is to exploit the SIMT architecture of the GPU through device functions. This strategy yields higher parallelism in our algorithms (occupancy) that results in better use of the GPU, and therefore higher performance. the Householder vector. In order to reuse data in shared memory, we propose a big-tile setting which will be described in the next section. Third, if the underlying computation is the same, only one copy of device function is maintained for different kernels. Thus, using a device functions-based methodology to designs for our internal APIs and to build higher-level algorithms on top of these APIs, is a central characteristic of our developments towards taking advantage of the aforementioned benefits. A c c e p t e d M a n u s c r i p t Therefore, to avoid many kernel launches, in our design, each matrix is assigned to a thread block, and the synchronization is accomplished by barriers inside the thread block. We call this thread block setting big-tile setting. The naming is from this observation: if the tile (i.e., thread block size) is big enough that a whole matrix (from the batch) is inside the tile, that matrix computation reduces to the point that one thread block accesses the whole matrix.
However, compared to the big-tile setting, the classic setting with multiple thread blocks processing one matrix has a higher degree of parallelism as different parts of the matrix are processed simultaneously, especially for large square matrices. Thus, overall, there is a trade-off. Big-tile setting allows data to be reused through shared memory, but suffers a lower degree of parallelism. The classic setting has a higher degree of parallelism, but may lose the data reuse benefits. The optimal setting depends on many factors, including the algorithm type and matrix size, and is often selected by auto-tuning (as in Section 7.2). Our experience shows that for the panel factorization, the big-tile setting has advantage. While for the trailing matrix update with GEMM computation, the classic setting is preferred.
Batched Bi-diagonalization Implementations on GPUs
One approach to the batched problems is to consider that the entire matrix is small enough to fit into shared memory. For example, the current size of the shared memory is 48 KB per streaming multiprocessor (SMX, or computing unit on AMD GPU) for the high-end NVIDIA K40c GPUs, which is a low limit for the amount of batched problems data that can fit at once. However, completely saturating the shared memory per SMX can decrease the performance of memory-bound routines since only one thread-block will be mapped to that SMX at a time. Due to a limited parallelism in the factorization of a small panel, the number of threads used in the thread block will be limited, resulting in low occupancy, and subsequently poor core utilization. The advantages of multiple blocks residing on the same SMX is that the scheduler can swap out a thread block waiting for data from memory and push in the next block that is ready to execute [23] . We found that using a small amount of shared memory per kernel (less than 10KB) not only provides an acceptable data reuse, but also allows many thread- where for sizes larger than 32, all 32 threads of a warp progress to do a tree reduction to reduce to 32 elements. The last 32 elements are reduced to one by only one thread.
The Householder reflectors are frequently accessed and are loaded in shared memory.
A set of GEMV routines are called to update the rest of the panel and matrices X and Y .
Since there are n b steps, these routines are called n b times; thus, one can expect that the performance depends on the performances of Level-2 and Level-1 BLAS operations.
Hence, it is a slow, memory-bound routine.
Trailing matrix updates with GEMM: The update is achieved by two GEMMs with the matrices X and Y returned from the panel factorization. The first one is a GEMM with a non-transpose matrix and a transpose matrix (A = A −V * Y ), followed A c c e p t e d M a n u s c r i p t by another GEMM with a non-transpose matrix and a non-transpose matrix (A = A − X * U ) . The update is directly applied on trailing matrix A. However, for very small matrices it might be still difficult to extract performance from Level-3 BLAS kernels.
Auto-tuning
The efforts of maximizing the performance of BLAS, especially GEMM, generally fall into two directions: writing assembly code and source level code tuning. The vendor libraries (e.g., Intel MKL, AMD ACML and hipBLAS, NVIDIA CUBLAS) supply their own routines on their hardware. To achieve performance, the GEMM routine is implemented in assembly code, like the CUBLAS GEMM on Kepler GPUs.
The assembly code usually delivers high performance. A disadvantage is that it is highly architecture-specific. The vendors maintain the performance portability across different generations of their architectures [24] . Another direction is to explore the source level code auto-tuning to achieve optimal performance (within a preset kernel design space). Different from assembly code, source code auto-tuning relies on the compilers to allocate registers and schedule instructions. The advantage is that source code is architecturally independent and is easy to maintain. Our effort focuses on source code auto-tuning.
Batched Level-3 BLAS GEMM
The performance of linear algebra routines highly relies on the Level-3 BLAS GEMM. Our batched GEMM is modified from the standard MAGMA GEMM [25] .
The template parameters of our batched GEMM include the number of threads, the size of shared memory, and the data tile size. The product of the sizes for these parameters produces a large search space, but it can be powerfully pruned by constraints.
The derived constraints of the search space include correctness, as well as hardware constraints and soft constraints. Hardware constraints stem from the realities of the accelerator architecture, like registers and shared memory size. Based on these metrics, invalid kernels violating the hardware requirement (like exceeding 48KB shared memory) will be discarded. The constraints may be soft in terms of performance. We 
Different Batched Level 2 BLAS GEMV instances Tuning
In matrix-vector multiplication using a non-transpose matrix (GEMVN), a reduction is performed per row. Each thread is assigned to a row and a warp of threads is assigned to a column. Each thread iterates row-wise in a loop and naturally owns the reduction result. Since matrices are stored in column-major format, the data access in GEMVN by the warp is in a coalescing manner.
However, in GEMV using a transpose matrix (GEMVT), the reduction must be performed on each column. Assigning a thread to a column will make the reduction easy, but will lead to memory access in a striding way. To overcome the non-coalescing problem in GEMVT, a two-dimension thread block configuration is adopted. Threads 
Performance on Nvidia GPU
We conducted our experiments on a multicore system with two 8-cores socket Intel Xeon E5-2670 (Sandy Bridge) processors with each running at 2.6 GHz. Each socket has a shared 20 MB L3 cache, and each core has a private 256 KB L2 and a 64 KB L1 cache. The system is equipped with 64 GB of memory and the theoretical peak in double precision is 20.8 Gflop/s per core, i.e., 332.8 Glop/s in total for the two sockets.
It is also equipped with an NVIDIA K40c GPU with 11.6 GB GDDR memory per card running at 825 MHz. The theoretical peak in double precision is 1, 430 Gflop/s. The GPU is connected to the CPU via PCIe I/O hubs with 6 GB/s bandwidth.
In our testing, we assume the data already resided in the processor's memory. Unless explicitly noted, the memory transfer time between processors is not considered.
We believe this is a reasonable assumption since the matrices are usually generated and processed on the same processor. For example, in the high order FEMs, each zone assembles one matrix on the GPU. The conjugation is performed immediately, followed by a batched GEMM. All the data is generated and computed on the GPU [1].
Performance study and Optimization of the bidiagonalization
Since the performance of the batched GEMV on K40c is around 40 Gflop/s (as shown in Figure 4a ), the GEBRD roofline bound is 80 Gflop/s according to equation (1). A sharper bound, using equation (1) and that the Batched GEMM performance is 323 GFlop/s, is 2 * 323 * 40 323+40 ≈ 71 GFlop/s. Figure 5 demonstrates the performance improvement progress of our implementation. The non-blocked version purely composed of Level 2 BLAS operations does not scale any more after size 256. The first non-optimized blocked version v1 follows the LAPACK's two-phase implementation as depicted in Algorithm 1 in which the trailing matrix is updated with Level 3 BLAS operations. Additional memory allocation overhead has to be introduced in order to use the array of pointers interfaces in the A c c e p t e d M a n u s c r i p t blocked algorithm. Below size 224, the performance of version v1 is even slower than the non-blocked due to the overhead. Beyond 224, it starts to grow steadily because of GEMM performance.
The main issue of the blocked version v1 is that the GEMVs are not optimized for instances required by the GEBRD. By tuning these GEMVs, as described in Section 7.2, the performance is immediately doubled in version v2. These GEMV routines are called in the form of device functions in the panel factorization kernel. The column/row vector of Householder reflectors and the to-be-updated column in matrices X and Y are repeatedly accessed at each step. We load them into fast on-chip shared memory. In order to reuse and synchronize data in shared memory, one matrix can not span multiple thread blocks. Therefore, we adopt the big-tile setting for the GEMV device functions in v2. As discussed in Section 6.1, there is a trade-off between data reuse (with big-tile setting) and the degree of parallelism (with classic setting). We found there is a switch over at size 128 for the two settings. We adopt classic setting beyond size 128 and big-tile for size less than 128 for square instances. The big-tile setting is still adopted for other wide/tall instances because the data caching proves to be more important. By this switch-over, the performance of version 3 boosts to 50 Gflop/s from 40 Gflops in version 2 at size 512.
By profiling the GEMV time in GEBRD step by step, we find it does not match the optimal performance obtained in our auto-tuning. In Figure 4a , the blue curves depicts the performance of GEMV transpose of double precision with every matrix being aligned in memory. However, when the algorithm iterates the sub-matrix as in GEBRD factorization, the starting address may not be aligned (green curve). The performance curve fluctuates because when the starting address of the sub-matrix is aligned in memory, the peak performance is reached; otherwise, it drops drastically.
The fluctuation is more serious for bigger matrices since most threads are mis-aligned as more threads are used in large size. read, it is coalescing that the 128-byte segment can be fetched by only one transaction.
Overall the number of memory transactions is reduced as shown in Figure 4b . Since the number of memory transactions decreases, the bandwidth is improved accordingly.
By padding elements in the multiplied vector as zeros, extra results are computed but finally discarded in the writing stage. Figure 4a shows that our padding technique enables the GEMV in the GEBRD algorithm to run at a speed close to the aligned speed. By padding, version 4 reaches 56 Gflop/s at size 512 which is 80% of the upper bound of the performance.
A breakdown of different components contributing to the overall time of GEBRD version 4 is depicted in Figure 6 . As the matrix size (n) increases, the time of the square matrix (in blue) begins to dominate. At a smaller size (n), the percentage of the wide/ tall matrix (asymptotically of size n b by n) to the square matrix (asymptotically of size n by n) is larger, since n b is fixed. Thus, the time spending on the wide/tall matrix (in red) is more prominent. The GEMM time is relative stable around 10% across different size problems. layer allowing users to create portable applications that run on AMD and NVIDIA GPUs with similar source code. Compared to the OpenCL C language, HIP is a C/C++ runtime API and kernel language that is very familiar to CUDA users [19] . Most GPU kernel optimizations techniques on OpenCL and CUDA equally apply on HIP programming.
AMD provides their vendor-optimized GPU BLAS, called hipBLAS [28] . Like cuBLAS, hipBLAS provides standard GEMM and GEMV, but no batched GEMV. Different than cuBLAS, hipBLAS only provides a strided batched GEMM routine, where the batched matrices are not stored in an array of pointers. Instead, a pointer is provided that points to the very first matrix, and the following matrices are located with a fixed address offset called stride.
The AMD GPU that we use to test and present our results is a Radeon R9 Fiji Nano. Fiji Nano is equipped with 4GB HBM memory. The specification of Fiji Nano vs K40c is outlined in Table 1 . Fiji Nano has much less double precision (DP) peak performance, but otherwise owns higher bandwidth. The software environment is the ROCm v1.6.3 with the HCC compiler based on Clang 6.0 [29] .
To run on AMD GPUs, we need to hipify our CUDA source code into HIP code with the assistance of the AMD provided hipify-perl script. We found that most of our A c c e p t e d M a n u s c r i p t batched GEMV and GEMM CUDA code, including device functions, and templates, can be hipified successfully except that we had to manually add an extra "hipLaunch-Parm lp" parameter to every HIP kernel.
We compare our batched GEMM against hipBLAS in three different instances: square matrix, K = 32, and K = 16, as these are of interest in the GEBRD routine. Gflop/s. This can be partially explained by the fact that the K40c GPU owns higher double precision computing capability (1.4 Tflop/s), which is about 2.8× that of the Fiji Nano (0.5 Tflop/s). However, the 2.8× cannot fully justify the fact that K40c is only 50 Gflop/s (18%) faster here. The reason is that batched GEMM is not totally compute-bound as the standard BLAS Level-3 GEMM, but a mix of compute-bound and bandwidth-bound routine. For the same byte access, the batched GEMM on small matrices has much less floating-point operations than the standard GEMM for larger matrices, as explained in [1] . When the matrix size is smaller, the performance is more bounded by bandwidth than the compute capability, which is the case for batched BLAS where matrices are much smaller than in standard BLAS. Although Fiji Nano has less computing capability, it owns higher bandwidth than the K40c, which offsets its double precision shortage.
The bandwidth advantage of the Fiji Nano is further shown in the fully memory- Fiji Nano 512 512 4
Conclusions and Future Work
The use of GPUs to accelerate scientific codes has been observed extensively on large dense and sparse linear algebra problems that feature abundant data parallelism.
Solving small linear algebra problems on current high-end many-core systems is more challenging. Still, small problems can be implemented relatively easy for multicore A c c e p t e d M a n u s c r i p t CPUs by taking advantage of the large CPU caches for data reuse. On the other hand, the development of small problems on GPUs is not as straightforward. We demonstrated that with a batched approach, small problems can be accelerated on GPUs to have a significant advantage over CPUs, as well.
We consider a batched two-sided bi-diagonalization based on the batched BLAS approach. We first adopt optimal blocking algorithm to maximize the GPU-friendly GEMM operations. We then propose device functions as the underlying components of the batched BLAS kernels. The use of device functions allows the data to be reused through shared memory and avoids multiple small kernel launches, but without demodulizing the BLAS-based LAPACK algorithm structure at the same time. The device functions are CUDA/HIP C++ templated. Auto-tuning is used to help find the optimal template setting for different types of kernels, e.g., standard, batched, or different instances for a type of kernel, like transpose, wide, slim for GEMV. Because the GEBRD algorithm iterates the sub-matrix resulting in a mis-aligned starting address, we adopt padding techniques to overcome the fluctuation. Using these techniques, we achieve 56 Gflop/s, which is 80% of the upper bound performance for the bi-diagonalization problem on a Kepler K40c GPU.
We compare our batched BLAS across different platforms and vendor-optimized threads on Sandy Bridge CPU, Nvidia cuBLAS on K40c GPU, and AMD hipBLAS on R9 Fiji Nano GPU. Our test shows that our batched GEMM implementation exceeds all of them in performance for the sizes of interest. For GPUs, we found that the HIP code on Fiji Nano is 2× faster than the CUDA code on K40c for the memory-bound GEMV routine. For batched DGEMM, Fiji Nano has inherent disadvantages in double precision computing ability, but heavily alleviated by its high bandwidth advantage.
The methods in this paper can be applied to other two-sided factorizations, e.g., the Hessenberg reduction (GEHRD) and the tri-diagonalization (SYTRD), as well.
Besides the GEMV, GEHRD and SYTRD use additional BLAS-2 TRMV (triangular matrix-vector multiplication) and SYMV (symmetric matrix-vector multiplication) operations, respectively. The same optimization techniques used for GEMV are applicable to TRMV and SYMV.
Page 26 of 29
