Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as Tensor Core Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4 × 4 or 16 × 16) to accelerate the convolutional and recurrent neural networks in deep learning workloads. In this paper we leverage NVIDIA's TCU to express both reduction and scan with matrix multiplication and show the benefits -in terms of program simplicity, efficiency, and performance. Our algorithm exercises the NVIDIA TCUs which would otherwise be idle, achieves 89% − 98% of peak memory copy bandwidth, and is orders of magnitude faster (up to 100× for reduction and 3× for scan) than state-of-the-art methods for small segment sizes -common in machine learning and scientific applications. Our algorithm achieves this while decreasing the power consumption by up to 22% for reduction and 16% for scan.
Introduction
Deep learning's reliance on matrix-multiplication for compute has driven both research and industry to develop matrixmultiplication accelerator hardware -collectively called Tensor Core Units (TCUs) in this paper. TCUs come under the guise of different marketing terms, be it NVIDIA's Tensor Cores [18] , Google's Tensor Processing Unit [10] , Intel KNL's AVX extensions [76] , Apple A11's Neural Engine [2] , or ARM's Machine Learning Processor [3] . TCUs are designed to accelerate Multilayer Perceptrons (MLP), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), or Deep Neural Network (DNN) in general TCUs vary in implementation [18, 36, 40, 43, 48, 54, 71, 74, 75, 76, 79, 87] , and are prevalent [1, 4, 8, 9, 10, 11, 24, 70] in edge devices, mobile, and the cloud.
Although TCUs are prevalent and promise increase in performance and/or energy efficiency, they suffer from over specialization -with only matrix-multiplication operations being supported. This limits their applicability to general algorithms and makes them confined to narrowly specialized libraries and application domains. Take NVIDIA's Tensor Cores, for example. For matrix multiplication, the NVIDIA Volta architecture achieves a 8× throughput increase -with each Streaming Multiprocessor (SM) capable of performing 1024 half precision operations per cycle using the TCUs compared to 128 half precision operations per cycle without the TCUs. This is enabled by the fact that NVIDIA's Volta GPUs dedicate a large portion of the SM processing unit (or subcore) chip area to TCUs, shown in Figure 1 . Currently algorithms other than general matrix-matrix multiplication (GEMM) do not utilize the TCUs-resulting in an idle TCU and low chip utilization for these algorithms.
The objective of the paper is to expand the class of algorithms that can execute on TCUs-enabling the TCU to be used for non-GEMM kernels. We choose reduction and scan, since a large body of work [30, 31, 32, 33, 60, 68, 82, 84] has shown that they are key primitives of data parallel implementations of radix sort, quicksort, string comparison, lexical analysis, stream compaction, polynomial evaluation, solving recurrence equations, and histograms. We formulate a simple mapping between reduction or scan and TCUs. Then we develop a library of warp-, block-, and grid-level primitives for reduction and scan that utilize the NVIDIA TCUs, and present how they can be used in concert to achieve near-ideal performance. Although this paper targets GPUs, the motivation, methods, and observations are applicable to a wide number of TCU implementations.
While the formulation is the main objective, we show that our reduction and scan algorithms is either order of magnitude faster or rivals the fastest GPU implementation, with much lower programming complexity. We also show that by leveraging the TCU we free up the general purpose ALUs on the SMs for tasks that cannot be expressed in terms of TCU operations.
The key contributions of the paper are as follows: 1. We show how to use the TCU to compute both reduction and scan -we believe we are the first to formulate these algorithms in terms of TCUs operations. 2. We measure our algorithms and show that we are orders of magnitude better than state-of-art algorithms for small segment sizes -this is common in mathematics (e.g. evaluating polynomials), scientific applications (e.g. finite difference), and machine learning (e.g. batch norm) -and are comparable to the fastest algorithm for large segments. 3 . We find that we are more energy efficient and decrease the utilization of general purpose ALUs -making them free for tasks that cannot be represented on TCUs. 4 . We describe the current usage and the programmability of the NVIDIA TCUs and evaluate GEMM on the TCUs using cuBLAS [15] , CUTLASS [17] and the CUDA TCU API. The results show that TCUs can be profitably used in a much wider range of libraries. 5. We provide insights into how to relax the current CUDA TCU programming interface constraints. This paper is divided as follows: we first describe the current NVIDIA TCUs and show the performance of GEMM and GEMV computation in Section 2. In Section 3, we give a background of reduction and scan and show the TCU algorithms for reduction (Section 4) and scan (5) . We then compare our implementation against state-of-the-art in Section 6. Section 7 describes the related work, before we conclude in Section 8.
NVIDIA Tensor Cores
A marquee feature of NVIDIA's Volta (Tesla V100) architecture is its TCUs-a programmable matrix-multiply-andaccumulate hardware units, called Tensor Cores 1 by NVIDIA. Figure 1 illustrates the processing block within each SM, with the V100 containing 80 SMs and each having 4 processing blocks. In turn, each processing block contains 2 Tensor Cores -for a total of 640 Tensor Cores on the V100 and achieving a 12× throughput improvement over previous generation Tesla P100 [22] . This section first describes the current usage of the NVIDIA TCUs, then details the current TCU API and presents some evaluation results that motivate our work. 
Current Usage
Tensor Cores have been only used to accelerate GEMM operations, most prominently through NVIDIA's CUDA libraries. The NVIDIA libraries, such as cuBLAS [15] and cuDNN [16] , require users to opt-in to use the Tensor Cores and both libraries accelerate GEMM computation (HGEMM for cuBLAS and convolution and recurrent neural networks for cuDNN). NVIDIA also provides the CUTLASS (CUDA Templates for Linear Algebra Subroutines) [17] library, which is a CUDA templated library that provides the building block primitives to write high performance GEMM-like kernels. Deep learning frameworks such as NVCaffe [52] , Caffe2 [5] , MXNet [14] , PyTorch [23] , TensorFlow [25] , and TensorRT [19] leverage the NVIDIA libraries for DNN training [13] and inference.
Programming Interface
NVIDIA also provides a CUDA C++ Warp Matrix Multiply and Accumulate (WMMA) [7] API to program the Tensor Cores directly. The current WMMA API provides warp-level matrix operations for matrix load (load_matrix_sync), matrix store (store_matrix_sync), and matrix multiply and accumulate (mma_sync). These APIs operate on a special data type fragment, which holds a matrix tile in thread-local registers. A helper function to fill a matrix fragment with a scalar constant (fill_fragment) is provided as well. NVIDIA has not provided any API for explicitly calling the TCU at sub warp level -neither in the IR nor in the PTX [20, 21] .
A load_matrix_sync distributes values of the matrix across the warp lanes. Threads within a warp utilize multiple Tensor Cores concurrently to perform the mma_sync operation -collaborating to compute the D M×N = A M×K .B K×N +C M×N , with M, N, K denoting the matrix dimensions. The API imposes limitations on the dimensions -requiring the shape M, N, K to be either 16, 16, 16 , 32, 8, 16 , or 8, 32, 16 .
Listing 1 shows a simple CUDA kernel that computes a 16, 16, 16 matrix multiplication within a warp using the WMMA API. Lines 4-6 declare the matrix fragments. The API supports 3 kinds of matrices -matrix_a (A), matrix_b (B), and accumulator (C or D) -with each having their own internal data layout 2 as well as loading, storing, and computing semantics. Users specify both the data type and the M, N, K shape of the fragments. For both the A and B matrix kinds, users specify whether the matrix is in column-or row-major order. Users specify the stride between rows using the leading dimension and load the data from either shared or global memory -Lines 7-8. Line 9 initializes the matrix_c elements to zero by broadcasting the scalar value 0 into each index of the fragment. Once the data is loaded, users perform the matrix multiplication operation -Line 10 -and store the results -Line 11.
The kernel in Listing 1 can be generalized to implement GEMM for arbitrary matrix dimensions in a manner similar to tiled matrix multiplication. For example, a naive implementation (referred to as WMMA HGEMM (naïve)) assigns a strip of 16 rows from matrix A and a strip of 16 columns from matrix B columns to each warp to compute a 16 × 16 tile of the output C. Each warp iterates through the A rows and B columns by loading 16 × 16 tiles of A and B from global memory into the fragments using load_matrix_sync, then performing mma_sync, and repeating. After the entire rows of A and columns of B have been consumed, the warp uses store_matrix_sync to store the accumulated C values into global memory. An optimized implementation (referred to as WMMA HGEMM) utilizes persistent threads where each thread block collaboratively loads multiple 16 × 16 tiles of matrix A and B from global memory to shared memory so that some of the tiles can be re-used across warps. The tiles are then loaded into fragments and the mma_sync operation is performed.
GEMM Evaluation
We evaluate the GEMM performance using Tensor Cores on a Tesla V100 PCI-E GPU with CUDA 9.2.88 through cuBLAS, CUTLASS (version 0.1.1), and hand written kernels using the WMMA API. The results are shown in Figure 2 .
For half precision GEMM (HGEMM), shown in Figure 2a , cuBLAS HGEMM with Tensor Cores achieves a maximum performance of 96.3 TFLOPS -approximately 85% the peak performance -and over 3.4× that of cuBLAS without the use of TCUs. For mixed precision GEMM (MGEMM), shown in Figure 2b , we achieved maximum performance of 85.8 TFLOPS on NVIDIA TCUs using cuBLAS, approximately 76% the peak performance, and is about 6.2× the performance of cuBLAS without Tensor Cores (the degradation of performance compared to HGEMM is due to output bytes count being twice as large). CUTLAS MGEMM is more performant that it's HGEMM, we suspect this is due to optimizations for mixed precision that are absent for half-precision.
GEMV Evaluation
The order of magnitude speedup of GEMM with TCU raises the question: can we formulate other algorithms in terms of matrix multiplication and also benefit from the TCU? The most obvious algorithm is matrix-vector multiplication (GEMV).
We implement HGEMV (half precision GEMV) and MGEMV (mixed-precision GEMV) with HGEMM or MGEMM call of dimension M, N, (K = 16) -with K required to be at least 16 by the cuBLAS API. This method wastes at least 15N memory loads and performs 15MN extra flops. We evaluate our implementations against cuBLAS SGEMV, since half precision GEMV is not present in cuBLAS. Figure 3 shows that even when accounting for both resource and computation waste, HGEMV, implemented using cuBLAS HGEMM with Tensor Cores, outperforms the cuBLAS SGEMV by at least 2×. Naïve (non-optimized 3 ) HGEMV and MGEMV are super imposed atop each other since the overhead of using mixed-precision is dwarfed by inefficient memory access. Both naïve versions still outperform cuBLAS SGEMV for large input sizes.
The WMMA HGEMV implementation saturates at about 900 GFLOPS in Figure 3 . This is because each FMA requires one matrix element and one vector element, each is 2 bytes in size, to perform the FMA operations. That is we need to fetch one byte from the global memory into the shared memory for every floating-point operation performed. Note that there is no reuse of matrix elements and heavy reuse of vector elements. Assume that the vectors are mostly loaded from the L2 cache due to heavy reuse, the 900 GB/s global memory bandwidth of V100 is the hardware limitation that prevents the any GEMV implementation to go beyond 900 GFLOPS in half precision.
Conventional Reduction and Scan on GPU
The GEMV evaluation shows that the matrix-multiplication performance of NVIDIA TCUs is high enough to tolerate resource and computation waste in algorithms. Driven by this observation, we examine how to formulate two widely used collectives -reduction and scan -to utilize TCUs. Both reduction and scan are memory-bandwidth bound so we expect that even with some resource and compute waste, their HGEMM-based implementations can beat or match the most highly tuned traditional implementations. Given a vector A = [a 1 , a 2 , . . . , a n ], the reduction, also called fold or total, of A is defined by its sum ∑ n i=1 a i . Segmented reduction is defined as reductions on subsets of the input vector. In a regular segmented reduction, all segments are the same size 4 . The scan operation, also called prefix sum, is defined by the vector
Segmented scan is defined similarly to segmented reduction.
State of the art libraries such as CUB [62] , MAGMA [26] , OpenMP runtime [38] , implement both reduction and scan in terms of warp-, block-, and device-level collectives, as illustrated in Figure 4 . The building blocks -warp-level reduction and scan, are commonly implemented using shuffle instructions [12] . The shuffle implementations, shown in Listing 2, allow threads within a warp to share values via registers without using shared memory or block-level synchronization.
TCU Reduction Algorithm
Intuitively, reduction can be represented as a special case of a matrix multiplication, since
The challenge is to map generic input sizes onto the fixed tensor shapes supported by the TCU. For simplicity, this paper will use the 16 × 16 matrix configuration as an example to implement the segmented reduction. Other configurations can be used in a similar manner to perform segmented reduction for multiples of 8 or 32. It is natural to extend our algorithms to support mixed precision computation. We use Reduction K to represent a K regular segmented reduction -partial reductions of the input uniformly partitioned into K element subsets. We will also use P as the matrix which has the first row set to 1 and the rest to 0 -i.e. (P) r,c = 1 if r = 0 0 if r = 0 , and the notation X to denote a matrix where all the elements are the constant value X. To make our formulation non-WMMA API specific, we present our algorithms in an API neutral way. In the following sections, we use LoadTile in place of the load_matrix_sync which accepts a matrix layout (default is row-major) and stride (default is 16). We abstract store_matrix_sync to make it addressable as if it was a linear array. We will also use the notation A.B +C to denote the mma_sync operation.
Warp-level Reduction
We look at warp-level reduction first, since it is the building block for both block-and grid-level reductions. Even though using shuffle instructions for warp-level reduction is efficient, it can be a bottleneck due to its limited throughput. On the NVIDIA Volta architecture only 32 warp shuffle operations can be performed within a clock cycle per SM. This section shows the formulation of warp-level reduction onto the TCU when the segment size is 16, 256, and multiples of 16 or 256. Support for arbitrary segment sizes can be realized either by padding the input with zeros or by masking through the P matrix. We find that padding introduces minimal overhead and is required in some cases to maintain the memory alignment imposed by the TCU API. Segment Size 16: The Reduction 16 algorithm, shown in Algorithm 1 and Figure 5 , performs warp-level reduction on 256 elements which represent 16 segments of size 16. The data is loaded from memory into a column-major order fragment (matrix A). Each row is then reduced using V = P.A. The result -first row of V -is stored in the output.
Algorithm 1
The Reduction 16 algorithm.
Depending on the TCU API's flexibility, one can formulate the Reduction 16 algorithm as either P.A where A is in columnmajor order or A.P where A is in row-major order. The A.P formulation is more friendly to loading of A elements since the input vector elements naturally form the row-major layout for A. However, there is no performance difference when using the WMMA API since: (1) the API not expose subfragment operations and (2) the CUDA compiler performs the transposition through registers. We formulate the algorithm as P.A to make it amicable for a more relaxed API, since the 16 element output is in linear row-major order and can be written to consecutive locations in the global memory. Segment Size 256: For handling the segments of size 256, one follows a pattern similar to Reduction 16 . First, all 256 elements are loaded onto the TCU. The rows are reduced using the same procedure as Reduction 16 . Finally, we reduce the first row to get a scalar. The algorithm is shown in Algorithm 2 and is a single iteration of the algorithm illustrated in Figure 6 .
Algorithm 2
The Reduction 256 algorithm.
Segment Size Multiples of 256: With the above Reduction 16 and Reduction 256 warp-level primitives, we can express segments that are multiples of either 16 (denoted by 16N) or 256 (denoted by 256N). We will first look at the 256N algorithm, since it will be used for representing the 16N algorithm.
A (each of length 16) into the matrix A with a stride of 16N, i.e., the beginning of each 16-element segment is 16N elements away from the beginning of the next segment in the original input vector. The 16 columns of A are then reduced and accumulated into the first row of V . This repeats for N iterations. This method is simple, works for arbitrary multiple of 16, and leverages GPU cache for small N. For large N this method suffers from bad locality, consistently misses the cache, and results in poor performance.
For N > 256, one can implement Reduction 16N in terms of Reduction 256N -shown in Algorithm 3. This makes better use of cache locality and reduces uncoalesced memory accesses. The left over, Reduction (N%16)×16 , can be implemented using the strided segmented 16N reduction method.
Algorithm 3
The Coalesced Reduction 16N algorithm. When the segment size is large, collaborative reduction within a block becomes profitable. We follow standard practice [62] to implement block-level reduction, but differ in that we still use the TCU to perform reduction on the partially reduced values. Algorithm 4 shows how we use warp-level reduction to implement the block-level Reduction 256N kernel.
Grid-level Reduction
When the segment size is very large a grid-level reduction might be needed. A naïve grid-level reduction for a list of length N involves two kernel launches. The first kernel launch performs a segmented reduction with the output stored in a list of partials. A second kernel then reduces the partials into a scalar. Although it is simple to implement, it achieves performance that's on par with the fastest algorithm.
TCU Scan Algorithm
Unlike reduction, it might be less intuitive to represent scan as a matrix multiplication. For a vector V of 256 elements, we can store it in row-major order within a 16 × 16 matrix A - 1 a 1 
We use the L.A matrix to create a G matrix where each element G j,i is the reduction of the j th row of L.A. That is, all elements in the j th row of G are of the same value -the sum of the all elements preceding the j th row of A, i.e. G j,i = ∑ j−1 k=1 ∑ 16 i=1 A k,i . The G matrix can be generated by multiplying L.A with a matrix with all element values set to 1. We then add G to the A.U matrix to generate the scan of Vwhich is read in linear row-major order. [16, 16] ) is broadcasted into the S matrix and the procedure is repeated.
Warp-level Scan
With the above formulation, we follow a similar structure to Section 4: first introducing warp-level primitives before presenting the block-and grid-level primitives. We write Scan K to represent a K regular segmented scan. Since the process of building warp-level, block-level, and grid-level scans from Scan K is very similar to that of reduction, we will only highlight the key differences. Segment Size 16: Is the RowScan equation and is illustrated in steps 1 , 2 , and 3 in Figure 9 . Segment Size 256: Is implemented using 3 matrix multiplications shown in Figure 9 and presented symbolically above. Segment Size Multiples of 16: Is similar to strided 16N reduction, with the difference being that we broadcast the last column rather than the reduced value, shown in Algorithm 5. Segment Size Multiples of 256: Only a small modification to Scan 256 is needed to implement Scan 256N and is illustrated in Figure 9 . Algorithm 6 shows that we keep track of the sum (last element of the R matrix) and broadcast it to the S matrix after each iteration. The S matrix is then used when performing subsequent iterations.
Block-level Scan
Algorithm 7 shows how to perform the scan at the block level. It first computes the segmented scan using the warp primitives and stores the reduced values into a partials list. A scan is performed on the partial list and the values are added to the intermediate results to get the output. S ← Broadcast(R[255]) 12: end for Algorithm 7 also exercises the TCU to perform the scan on the partially reduced values across tiles. On Line 16 we use the offset of the last row (240) and 256 as the leading dimension when loading the tile. This loads the last row of R across tiles into E. Line 17 then performs an exclusive scan on the last column of the E and stores the results into the list of partials 5 .
Grid-level scan
Similar to reduction, the segmented scan is used as a building block for the grid-level scan. The grid-level scan uses a text book implementation, scan-then-propagate strategy, and involves 3 kernel calls. The first kernel uses segmented scan and produces partially reduced values for each block. The second kernel performs a scan on the partially reduced values. The third kernel then uniformly adds the partially scanned values to their corresponding segments. 5 The implementation of LastColumnScan 16 is performed by just loading the last column values into the first row and performing an TCU version of the exclusive scan algorithm. Formulating the intermediate operation this way is needed to adhere to the CUDA WMMA API's byte alignment constraint for loading fragments. [16] Partial sums 6: S ← 0 7: for i ← 0; i < N; i ← i + warpsPerBlock do 8: idx ← gidx + 256(i + warpIdx)
9:
A ← LoadTile(in[idx . . . idx + 256]) 10 :
LA ← L.A + 0 
Evaluation
We evaluate our implementation on an Intel Xeon E5-2698 with CentOS 4.3, CUDA Driver 396.26, and CUDA Runtime 9.2.88 installed. We use the Tesla V100-PCIE GPU with 16GB of GPU Memory. The Tesla V100 sports HBM2 with a theoretical peak bandwidth of 900GB/s or 450 billion half precision elements per second. All the results below show the throughput of the algorithms in terms of billions of half precision elements per second.
We implemented 6 our algorithms as a C++ header library. The API is designed to be similar to CUB's high level API -providing functions such as TCU::SegmentedReduce, TCU::Reduce, TCU::SegmentedScan, and TCU::Scan. We employ auto-tuning to select the ideal algorithm, number of warps (or independent TCU operations) per block, coarsening factor (the number segments to perform within a warp), and block dimensions for the user-provided segment size.
Relaxing the WMMA API Constraints
Constraints arise when using the current WMMA API for non-GEMM computation. These limitations would not exist if one is to perform just GEMM computation. The constraints observed were: 1. Loads or stores must be performed at fragment granularity. 6 Accessible at https://www.removed/for_blind_review.
2. Loading and storing fragments can only be performed using global or shared memory; constant memory cannot be used. 3. The memory layout for the matrix kinds are not the same, making casting between matrix kinds a kluge. We address these limitations in different ways within our implementation. For (1) and (2) we use knowledge about the internal layout of the fragment [53] , we implemented API enhancements tailored to our usage. Listing 3 shows our API to support extended usage of the matrix_b fragment for both storing a constant matrix into a fragment (matrix_b_set_upper_triangular) and loading partial fragments (matrix_b_get_first_column). Although we can use the layout information to shuffle registers to address (3), we express the cast in terms of load/store available through the WMMA API. For example, to cast a matrix in the matrix_a format to matrix_b format, we first store the matrix into shared memory and then perform a load from memory to matrix_b. These extensions can be leveraged by compiler runtimes to perform casting between fragment kinds.
We observe that API extensions using fragment layout information requires less block synchronization, uses less sharedmemory -resulting in an increased performance by up to 5% for our algorithms. For brevity we omit these results.
Optimizing CUB for Half Precision
CUB is a C++ template library that contains multiple algorithms for the collectives. The CUB library contains fastest [63, 66] implementation for the reduction and scan collectives and is used by libraries such as Thrust [29] as well as most deep learning frameworks [5, 14, 23, 25, 68] . We compare against the latest release of CUB [62] (version 1.8) and evaluate against different parameters of the collectives. As an optimization, warp-level shuffle reduction and scan are implemented in PTX for integer, float, and double data types, since NVCC is currently unable to use the shuffle instruction's predicate to guard against invalid peers [12, 69] . CUB does not contain these shuffle-based optimizations for half precision. To make the evaluations fair and more technically meaningful, we implement these optimization for the half precision data type in CUB. The modified CUB is used for the evaluation to provide a more aggressive base of comparison.
Warp-and Block-level Reduction and Scan
Theoretically our warp-level TCU reduction algorithms require less than one fourth of the cycles of the warp-level shuffle reduction. For example, consider performing a warp-level Reduction 256 : the warp-level reduction show in Listing 2 requires 8 iterations of 32 element reduction to reduce each segment. The total cycles is therefore 256, since each shuffle instruction and addition take 4 cycles. On the other hand, our algorithm performs the reduction using two matrix multiplications or 64 cycles -since each TCU WMMA matrix multiplication requires 32 cycles. However, reduction is known to be memory bound, with the ideal performance bounded by memory copy speed.
We evaluate the TCU segmented reduction algorithms against cub::DeviceSegmentedReduce::Sum by fixing the number of input elements and varying the segment size, shown in Figure 10 . When the segment size is less than 256, the 16N algorithm is used. Its performance then degrades for large segment sizes due to its strided access pattern resulting in uncoalesced global memory access. When the segment size is larger than 256, the 256N algorithm is used, but again suffers from performance degradation after segment size 2 15 due to low occupancy. When the segment size is large -greater than 2 15 , the block-level 256N reduction is used. Our TCU implementation achieves more than 90% of the peak throughput consistently for variable segment size and is always better than the CUB implementation.
When segment size is large and the number of segments is small, the performance of both CUB and our implementation drops. Since each segment gets mapped onto a block, a small number of segments causes some SMs to be idle. For example when segment size is 2 25 , both CUB and our implementation achieve an occupancy of around 0.25 and SM efficiency of around 40%. A better strategy for these cases would be to assign multiple thread blocks to collaboratively reduce each segment when the size of the segment is very large. This can be achieved using CUDA 9's cooperative groups [6] . Such optimization is outside the focus of this paper, which is to propose and evaluate TCU-based reduction and scan primitives.
Our TCU implementation largely outperforms CUB's device-wide segmented reduction for different segment size. Through profiling, we identified the major predictors of performance to be, in the order of importance, the number of half-precision floating-point instructions (inst_fp_16 in NVProf metric), warp instructions (inst_inter_thread_communication), and integer instructions (inst_integer). Consistently, our implementation's half-precision instructions is approximately equal to the number of total elements (2 31 ) while CUB requires a much larger number. Moreover CUB requires large number of integer and warp shuffle instructions while our implementation uses no warp shuffle instructions and a smaller number of integer instructions. This contributes to the 100× performance improvement observed for segment size 16. we are able to achieve throughput within 89% and 97% of ideal throughput (the theoretical peak is 225 billion half precision elements per second).
We examined the power consumption by measuring the average power within the execution phase of the kernel using NVVP. Our implementation uses 7.4 − 22.3% less power compared to CUB across different segment sizes. Again, this is because of the efficient use of the FP16 and INT ALUs as well as better SM and DRAM utilization. We note that our algorithm leaves the ALUs (other than the TCU) idle, allowing less contention on these units.
CUB provides a cub::WarpReduce, applicable for segment sizes 16 and 32, to compute a parallel reduction of elements within a warp. CUB also provides cub::BlockReduce to perform reduction within a block. These two primitives require users to partition the data and construct the kernel. Since CUB's device-wide segmented reduction does not perform well for segment size smaller then 2 13 , we evaluate our TCU implementations against cub::WarpReduce and cub::BlockReduce implementations, shown in Figure 11 . The cub::WarpReduce implementation is tunable on block size. The cub::BlockReduce implementation is tunable on block size, thread coarsening factor, and reduction algorithms. We compare our implementation to the best CUB implementation. We find that our TCU implementations is still faster for segment size smaller than 1024, and is comparable to cub::BlockReduce for the other cases.
We evaluate TCU segmented scan algorithms against Thrust's implementation (inclusive_scan_by_key), since CUB has no user visible API for segmented scan. The Thrust implementation utilizes CUB's internal warp-and block-level scan to implement the scan-by-key operation. We use different segment sizes with a fixed number of input elements -the results are shown in Figure 12 . Thrust, consistent with previous observation [42] , has constant performance irrespective of the segment size. Our scan TCU implementations achieve more than 89% of the peak throughput and is 3× faster than thrust for small segment sizes. We observe lower power consumption compared to Thrust -observing it to be either equivalent in power usage or up to 17% less. Our segmented scan is not ideal for large segment sizes since, as explained in Section 4, only a small number of blocks get launched and thus the GPU is under utilized. This can be remedied using the same strategy described for reduction. CUB provides cub::WarpScan to compute a parallel scan of items partitioned within a warp, and cub::BlockScan within a block. Similar to reduction, these two primitives require more programming effort from the users to partition the data and construct the kernel. We evaluate our TCU segmented scan against the best cub::WarpScan and cub::BlockScan implementations, shown in Figure 11 . The CUB scan implementations have the same tunable parameters as the CUB reduction implementations. We can see that our TCU implementations are still largely faster for small segment size, and are at least comparable to cub::BlockScan in other cases.
Grid-level Reduction and Scan
Unlike the warp-and block-level operations, this paper does not attempt to optimize grid-level operations -opting to use a naïve implementation for the grid-level collectives. The naïve implementation involves multiple kernel launches. We include the evaluation results to show that even our naïve gridlevel implementation achieves performance that is better or comparable to that of CUB and Thrust. We compare against both CUB and Thrust for full reduction, Figure 13 , and scan, Figure 14 . For both cases, our implementation uses the 256N block-level algorithm. Even with our naïve grid-level implementation, we are able to mach the performance of CUB and are considerably faster than the Thrust implementation. For reduction and scan, the TCU implementation is slightly faster than CUB with large input sizes being bounded by memory bandwidth and is within 98% (for reduction) of peek memory copy bandwidth.
For scan, our current unoptimized implementation internally uses CUB to reduce the partial values (kernel 2 described in Section 5.3). CUB, however, fails for inputs larger than 2 21 which causes our implementation to fail. Future optimized implementations would not use CUB.
Related Work
The mapping of some algorithms onto matrix multiplication has already been well studied [49, 56, 72, 83] . Similarly, both reduction and scan are well studied from a performance and application [30, 31, 32, 46, 55, 64, 78] aspect on a wide range of architectures and have been heavily evaluated across GPU architectures [39, 41, 61, 77, 85] . To our knowledge however, there has been no attempt at mapping either reduction or scan in terms of matrix multiplication.
Considerable research has been done on the development of performance portable reduction and scan kernels [28, 34, 57, 80] . These compilers express the algorithms as system of alternative building blocks that are then composed and auto-tuned at compile time for both the target architecture and the input characteristics. These tools are synergistic with our technique, since we are able to add our algorithm as another building block to implement reduction or scan.
Previous work [35, 62, 65, 73, 85] has also shown that optimizations can be made to either avoid or hide the overhead of multi-kernel launches. These optimizations would enable our grid-level operations to be competitive for large sizes when compared to state-of-the-art methods. Other research looked at specific cases of scan, in [58] the authors look at performing scan on tuples while minimizing global reads and facilitating latency hiding. Recently there has been some work in applying scan and reduction to optimize database queries [37, 47, 81] .
Work describing NVIDIA's Tensor Cores is scarce. In [53] , the authors use microbenchmarks to discern microarchitectural details of the V100 architecture. In [45, 59] use half precision and Tensor Cores to implement iterative solvers. They use half precision along with low quality solvers to compute the initial conditions and then switch to both higher precision solvers for subsequent iterations. The authors also examine the error incurred when using Tensor Cores and halfprecision for HPC workloads. In [27, 44, 67] the authors devise schemes to accelerate neural network training with limited loss of precision using half precision and Tensor Cores.
Conclusion
This paper leverages specialized hardware that was developed to optimize matrix multiplication for deep learning to implement both reduction and scan. This is in the same vein to early GPU research which implemented these algorithms in terms of shaders. We point to directions for future API and architectural changes to relax some of the TCU constraints such as loading fragments from constant, extracting single row or single column, etc. -resulting in a simplified implementation that achieves more energy efficiency and performance.
In this paper we showed how to map both reduction and scan onto TCU. We believe we are the first to formulate these algorithms to exercise the TCU. Our implementation achieves up to 100× speedup on reduction, up to 3× on scan, and showed that the performance rivals the state of the art implementation in the worst cases. We observed a 22% less power consumption for reduction and 16% for scan. We are able to make use the otherwise idle TCUs-enabling better GPU utilization for kernels that exercise the general purpose ALUs.
Future work would leverage the techniques described to map more algorithms and functions onto TCUs. We are specifically interested in transcendental and special functions (such as Tanh and Log), since the NVIDIA special function units have been observed to be the bottleneck in HPC applications. We also want to express neural network layers in terms of TCUs-using a cuDNN like API. Some layer implementations and layer fusion opportunities are natural extensions of our work: such as the computation of variance in batchnorm [50, 51, 86] or the evaluation of special functions in activation layers.
