To respond to the need of efficient training and inference of deep neural networks, a pletora of domainspecific hardware architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures is a hardware circuit for efficiently computing a dense matrix multiplication of a given small size.
developed for some deep learning models, such as multilayer perceptrons, convolutional neural networks, and recurrent neural networks.
Although these architectures significantly vary in hardware specifications and architectures, they share hardware circuits for efficiently multiply small and dense matrices of fixed size. Indeed, matrix multiplication is one of the most important computational primitives in deep learning. By using a terminology introduced in [8] , we refer to all accelerators supporting hardware dense matrix multiplication as Tensor Core Units (TCUs) (or simply tensor units). By focusing on a specific computational problem, namely matrix multiplication, TCUs exhibit at the same time both high performance and low energy consumption with respect to traditional CPUs or GPUs approaches [12] .
TCUs are becoming the mainstream technology for deep learning, with constantly decreasing economic costs and a tighter integration with the main processing unit. Although TCUs were developed for domain-specific problems, it would be interesting and profitable to extend their application domain, for instance by targeting problems from linear algebra, data mining or machine learning (other than deep learning). A similar scenario appeared with the introduction of GPUs: introduced in the 2000s for accelerating computer graphics (in primis, video games), GPUs have been used since then for very different computational problems, like bioinformatics [19] , data mining [6] , and neural networks [25] . Will TCUs have the same wide impact of GPUs?.
The goals of this paper are to present a framework for designing and analyzing efficient algorithms for TCUs, and to expand the class of algorithms that exploit TCUs. We first introduce in Section 3 a computational model for tensor core units, which we call (m, )-TCU, that captures the main features of tensor units:
High performance matrix multiplication.
For a given model parameter m ≥ 1, two matri-ces of size √ m × √ m can be multiplied in O (m) time by using the (parallel) hardware circuit in tensor units.
Latency cost. The model parameter captures
the latency cost for setting up the tensor unit and preparing the input/output operands.
3. Asymmetric behavior. Tensor units can efficiently deal with a left matrix with a large number n of rows (i.e., a tall left matrix). Thus, we let the (m, )-TCU to natively multiply a n × √ m matrix for a √ m × √ m matrix, without splitting the left matrix into submatrices of size √ m× √ m. The multiplication requires O (n √ m) time and we let n ≥ m to be an user defined value.
We recall that the O (m) time for multiplying two √ m × √ m matrices is due to the hardware circuit implementation that implement a parallel systolic algorithm, and not to hardware implementations of fast matrix multiplication algorithms.
We will then use in Section 4 the (m, )-TCU model for designing and analyzing some algorithms for linear algebra problems, such as dense and sparse matrix multiplication, FFT, integer multiplication, and polynomial evaluation. Our results show that TCUs can be exploited in these computational problems, for which tensor accelerators were not initially designed. Finally in Section 5, we observe that some lower bounds on the I/O complexity in the external memory model [28] translate into lower bounds on the running time in the TCU model.
Previous results
Tensor Core Units. The literature on tensor core units has mainly focused on architectural issues, see e.g. [12, 30, 23] . Some works, like [18, 21] , have investigated the programming model and performance of deep neural networks workloads in the NVIDIA Tensor Cores.
To the best of our knowledge, the only papers that broad the class of algorithms expressible as TCU operations are [26, 8, 7] . The papers [8, 7] design algorithms for scanning and reduction that exploit NVIDIA tensor cores. In [26] , it is shown how to speed up FFT by exploiting the half precision multiplication capability of NVIDIA tensor cores. The algorithm in [26] uses the Cooley-Tukey algorithm where FFTs of size 4 are computed using tensor cores, and it is a special case of the TCU algorithm proposed in this paper in Section 4.3. However, no one of the previous works have proposed a formal computational model and investigated the computational complexities of the algorithms.
Matrix multiplication From a practical point of view, the most efficient algorithms for dense matrix multiplication are those based on the definition of matrix multiplication and which require Θ n 3/2 for multiplying two √ n × √ n matrices (see e.g. the BLAS library). Nevertheless from a theoretical point of view, several papers have been investigating algorithms requiring O n ω/2 operations for some ω < 3. The work of Strassen [27] showed that ω ≤ 2.81, and then subsequent works have been improving the upper bound on ω, up to the current best result ω < 2.3728 [29, 16] .
Related to our paper, there are some other works which tries to solve several computational problems, like triangle listing and sparse matrix multiplication, by using as a black box a fast matrix multiplication algorithm requiring time O n ω/2 for some 2 ≤ ω ≤ 3. For instance, [5] shows how to list t triangles in a (sparse) graph with m edges
time. The work [11] shows how to compute a sparse matrix multiplication in timeÕ √ nZ (ω−1)/2 + I , where Z is the number of non-zero in the input √ n × √ n matrices and Z is the number of non-zero in the output matrix. These algorithms automatically work in the (m, )-TCU by replacing the O n ω/2 running time of fast matrix multiplication in the RAM model, with the O n ω/2(m+ ) /m ω/2−1 running time in the TCU model (see Theorem 1).
Preliminaries

Technical overview on some TCUs
With the advent of deep learning as the dominant paradigm for artificial intelligence based systems, a plethora of dedicated parallel architectures have been developed for accelerating training and inference phases. We briefly describe the main characteristics of the most relevant ones: Google's Tensor Processing Unit (TPU) [12] and NVIDIA's Tensor Cores [20] (TCs). They are used to accelerate convolution layers and the related matrix multiplication operation, which represent the most computationally expensive part of a deep learning applications.
The Tensor Processing Unit (TPU) is a custom application specific integrated circuit developed for accelerating the inference phase of deep neural networks. It is built around 65536 8-bit ALU Matrix Multiply Unit (MMU) and on-chip memory. TPU has been designed as an accelerator to plug into on traditional server as GPUs do through PCIe I/O bus. On the contrary to GPUs, the processor sends the instruction buffer instructions by avoiding the use of a fetch unit inside the TPU. Data are sent from the CPU host memory to the local TPU memories, named Unified Buffer and Weight Memory, with the goal of offloading all the computation. In combination with a systolic execution model which allows data prefetching from the memory to MMU, this design allows to reduce the overhead and maximize the throughput. Specifically, MMU is composed by 256×256 8-bit multiplier-accumulator units. The 16-bit products are stored in 32-bit Accumulators. The MMU performs 256-element partial sum per clock cycle. Briefly, the typical TPU workflow is summarized as follows:
1. Read Host Memory reads data from the CPU host memory into the Unified Buffer memory.
2. Read Weights reads elements from Weight Memory into the Weight FIFO to fill in the MMU.
3. MatrixMultiply performs a matrix multiplication. This operation takes a variable-sized n × 256 input, with n < 96K, multiplies it by a 256 × 256 constant elements, and produces a n×256 output, taking about n cycles to complete.
4.
Write Host Memory writes data from the TPU (Unit Buffer) to the CPU host memory.
In NVIDIA "Volta" architecture, Tensor Cores extend the traditional GPU architectures and the parallel programming interface (CUDA) by providing dedicated units to efficiently perform dense-matrix by dense-matrix multiplication. The Volta microarchitecture revised NVIDIA Streaming Multiprocessors (SM) design: the SM consists of 4 processing blocks. Each block contains: 2 Tensor Cores, 8 Floating Point (FP) unit operating at 64-bit, 16 FP operating at 32-bit, 8 Integer Unit operating at 32-bit and one Special Function Unit. Concerning the memory hierarchy, the L1 cache and the shared memory are located in same in-chip surface. The L2 is also included in the same die and it is accessed among multiple SMs. The High Bandwidth Memory (HBM) can be addressed by a 4096-bit memory interface. Tensor Cores allows to perform 64 floating-point Fused-MultiplyAdd (FMA) operations in one cycle. FMA operates in half-precision (16-bit) and optionally stores a 32-bit output by using a sum accumulator. Such design allows to deliver a peak of 31.4 Tera Floating Point Operations Per Second (FLOPS) with half precision on a Tesla V100 (640 Tensor Cores spread over 80 SM). From the programming point of view, although TC basically performs one matrix-multiply and accumulate operation on 4 × 4 matrices, it is possible to multiply a 16 × 16 matrices at programming level with one CUDA warp (32 threads). This requires primitives for data loading and the synchronization of the result from/to registers through load and store units. On the contrary to TPUs, which also provide specific units for accelerating other neural networks operations (e.g., Activation Unit), TCs can be used for a generic matrix multiplication.
Systolic algorithms for matrix multiplication
The circuits which implement matrix multiplication in the Google TPU and in the NVIDIA TC adopt a systolic algorithm for matrix multiplication. A systolic algorithm is an algorithm for a systolic array, that is a network of processing elements (PEs) that rhythmically compute and pass data through the system [17] . For the sake of completeness, we formalize the systolic algorithm implemented in the Google TPU [12] . The implementation on NVIDIA TCs is slightly different but shares the same high level structure, and we refer to [17] for a more complete overview on systolic algorithms. The systolic algorithm is implemented on a 2-dimensional array of m PEs, and we denote with p i,j the PE at row i and column j, for each 0 ≤ i, j < √ m. Let A and B be the two √ m × √ m input matrices, and let C = A · B be the √ m × √ m output matrix; we denote with a i,j , b i,j , c i,j the entry in row i and column j of A, B, C respectively; for notational simplicity, we let a i,j = 0 if i, j < 0 or i, j ≥ √ m. The algorithm works as follows (see also Figure 1 ):
• In the first √ m steps, matrix B is pushed within the m PEs so that p i,j contains b i,j .
• The algorithm then executes 2 √ m steps. In each step k, with 0 ≤ k < 2 √ m, each PE p i,j receives: 1) an entry a of A from the left PE p i,j−1 or the input a k−i,i if j = 0; 2) a partial sum c from the top PE p i−1,j , or we set c = 0 if i = 0. Then, each p i,j computes c = c+a·b i,j (recall that b i,j is in the local memory of p i,j . Finally p i,j forwards a to the right PE (p i,j+1 , if any) and c to the bottom PE (p i+1,j or it is output if i = √ n − 1).
• We observe that each p √ m−1,j outputs c i,j at the end of step k = √ m + i + j.
We observe that the algorithm can be extended to compute C = A · B where A is a n × √ m matrix and B is a √ m × √ m matrix, by just continuing pumping all rows within the system. This feature is not available in the NVIDIA implementation since matrix B does not reside in the local PE memories, but it is percolated within the array as matrix A.
The (m, )-TCU model
We propose a computational model for tensor core units that captures the following properties:
• Matrix acceleration. The hardware circuits implement a parallel algorithm for multiplying two matrices of a fixed size, and the main cost is dominated by reading/writing the input and output matrices. For a given hardware parameter m, we have that the multiplication of two matrices A and B of size
With time, we mean the running time as seen by the CPU clock and it should not be confused with the total number of operations executed by the unit, which is always O m 3/2 . Indeed, no existing tensor unit implements fast matrix multiplication algorithms, as for instance Strassen. The matrix multiplication operation is called by an instruction specifying the address (in memory) of the two input matrices and of the output matrix where the result will be stored; data will be loaded/stored by the tensor unit.
• Latency cost. A call to the tensor unit has a latency cost. As the state of the art tensor units use systolic algorithms, the first output entry is computed after Ω ( √ m) time. There are also initial costs associated with activation, which can significantly increase when the unit is not connected to the CPU by the internal system bus or is shared with other CPUs. We thus assume that the cost of the multiplication of two matrices of size
, where > 0 is the latency cost.
• Asymmetric behavior. As tensor units are designed for improving training and inference in deep networks, the two matrices in the multiplication A × B are managed differently. Matrix B represents the model which we are currently training/using, while rows in matrix A represent the vectors of √ m entries to be evaluated. As the same model can be applied to k vectors, with n >> √ m, it is possible to first load the weights in B and then to stream the n rows of A into the tensor unit (possible in chunks of √ m rows), reducing thus latency costs. Thus, we assume in our model that two matrices of size n × √ m and √ m× √ m can be multiplied in time O (n √ m + ), where n ≥ √ m is the number of rows and it is a value defined by the programmer.
Following the previous observations, we define the Tensor Computing Unit (TCU) model as follows. The (m, )-TCU model is a standard RAM model where the CPU contains a circuit, named tensor unit, for performing a matrix multiplication A × B of size n × √ m and
, where m ≥ 1 and ≥ 0 are two model parameters and n ≥ √ m is a value (possibly input dependent) specified by the programmer. The matrix operation is initialized by a (constant size) instruction containing the addresses in memory of the two input matrices A and B, of the output matrix C, and the row number n of A. The running time (or simply time) of a TCU algorithm is given by the total cost of all operations performed by the CPU, including calls to the tensor units. We assume no concurrency between tensor unit, memory and CPU, and hence at most one component is active at any time. We also assume that each CPU instruction works with operands of one memory word, and that each entry of a matrix in the TCU requires one memory word. Each memory word consists of κ bits (in general, we denote κ = Ω (log n) where n is the input size, that is enough for storing the input size in one word).
Discussion on the model
As our goal is to understand to which extent algorithms can exploit circuits of fixed size for matrix multiplication, we avoided the modeling of parallel tensor units and the limited numerical precision of tensor units in this work. We also do not directly consider the bandwidth between CPU and TCUs, assuming that one memory word can be transferred in the tensor unit in O (1) CPU cycles. We now make some considerations on how Google TPUs and NVIDIA TCs fit our model.
In the Google TPU (in the version described in [12] ), the right matrix B has size 256 × 256 words (i.e., m = 65536). The left matrix A is stored in the local unified buffer of 96k × 256 words; thus, TPUs can compute the product between two matrices of size 96k×256 and 256 ×256 in one (tensor) operation. The number of rows of the left matrix in the TCU model is an user defined parameter (potentially a function of the input size); on the other hand, the number of rows of the left matrix in the TPU is user defined but it is upper bounded by a hardware-dependent value (i.e., 96K). Being this bound quite large, a TPU better exploits a tall left matrix than a short one, as in the TCU model. The systolic array works in low precision with 8 bits per word (κ = 8). Although the bandwidth was limited in the first TPU version (16GB/s), it has significantly increased in more recent versions (up to 600 GB/s). Although TPU has a quick response time, the overall latency is high because the right hand matrix has to be suitably encoded via a TensorFlow function before loading it within the TPU: in fact, the programming model in TPUs is strongly integrated with TensorFlow, and it does not allow to use bare matrices as inputs. The high latency cost might mitigate the fact that our model does not capture limited bandwidth.
The programming model of the NVIDIA Volta architecture allows to multiply matrices of size 16 × 16, although the basic hardware unit works on 4 × 4 matrices; we thus have m = 256. Memory words in systolic array have, as in the TPU, a limited precision (κ = 16 bits). TCs exhibit high bandwidth and low latency, as data are provided by an high bandwidth memory shared with the GPU processing units. Furthermore, both matrices A and B can be loaded within TCs without a special encoding as in TPUs, since the NVIDIA Volta natively provides support for matrix multiplication. Finally we observe that, as TCs are within a GPU, any algorithm for TCs has also to take into account GPU computational bottlenecks (see e.g. [14, 1] ).
Algorithms
Dense matrix multiplication
A Strassen-like algorithm for matrix multiplication was defined in [4] as a recursive algorithm that utilizes as base case an algorithm A for multiplying two √ n 0 × √ n 0 matrices using p 0 element multiplications and O (n 0 ) other operations (i.e., additions and subtractions); we assume n 0 = O (p 0 ). Given two √ n × √ n matrices with n > n 0 , a Strassen-like algorithm envisions the two √ n × √ n matrices as two matrices of size √ n 0 × √ n 0 where each entry is a submatrix of size n/n 0 × n/n 0 : then, the algorithm recursively computes p 0 matrix multiplications on the submatrices (i.e., the p 0 element multiplications in A) and then performs O (n) other operations. For given parameters p 0 and n 0 , the running time of the algorithm is T (n) = O (n ω0 ), where 1 ω 0 = log n0 p 0 . By setting n 0 = 4 and p 0 = 8, we get the standard matrix multiplication algorithm (ω 0 = 3/2), while with n 0 = 4 and p 0 = 7 we get the Strassen algorithm (ω 0 = log 4 7 ∼ 1.403). Any fast matrix multiplication algorithm can be converted into a Strassen-like algorithm [22] .
The TCU model can be exploited in Strassen-like algorithms by ending the recursion as soon as a subproblem fits the tensor unit: when n ≤ m, the two input matrices are loaded in the tensor unit and the multiplication is computed in O (m) time. For simplicity, we assume m ≥ n 0 (otherwise the tensor unit would not be used) and we get the following result. Theorem 1. Given a Strassen-like algorithm with parameters n 0 and p 0 , then there exists a TCU algorithm that multiplies two n × n matrices on an (m, )-TCU model, with m ≥ n 0 , in time
Proof. The running time is given by the following simple recursion: We now show how to decrease the latency term, i.e., (n/m) 3/2 , in the TCU algorithm based on the standard algorithm. The idea is to keep as much as possible the right matrix B within the tensor unit by using a tall left matrix A. We split the left matrix A into n/m blocks A i of size √ n × √ m (i.e., vertical strips of width √ m) and the right matrix B into square blocks B i,j of size √ m × √ m, with 0 ≤ i, j < n/m. Then, we compute C i,j = A i · B i,j for each 0 ≤ i, j < n/m using the tensor unit in time O (n √ m + ). The final matrix C follows by computing the √ n× √ m
Theorem 2. There exists an algorithm that multiplies two n × n matrices in the (m, )-TCU model in time
If we require all elementary products to be performed in the tensor unit and to use only semiring operations, the above algorithm is optimal.
Proof. The upper bound follows since each multiplication C i,j = A i · B i,j requires O (n √ m + ) time, and there are n/m such terms. The cost of the final summation in negligible.
When using only semiring operations, any algorithm must computes n 3/2 elementary products. Since each call to a tensor computes m 3/2 elementary products in Θ (m) time using a systolic algorithm, we need Ω n 3/2 /m 1/2 time. Furthermore, since all entries of B must be loaded in the tensor unit at least once and we cannot load more than m entries in B per tensor operation, the algorithm has to load at least n/m distinct right matrices in the tensor unit; then a Ω ( n/m) lower bound on the time also follows.
Sparse matrix multiplication
A TCU algorithm for multiply two sparse matrices follows from the work [11] that uses as a black box a fast matrix multiplication algorithm for multiplying two √ n × √ n matrices in O n ω/2 time. Let I be the number of non-zero entries in the input matrices A and B, and let Z be the number of non-zero entries in the output C = A · B. We consider here the case where the output is balance, that is there are Θ (Z/ √ n) non-zero entries per row or column in C; the more general case where nonzero entries are not balanced is also studied in [11] and can be adapted to TCU with a similar argument. The algorithm in [11] computes the output in timẽ O √ nZ (ω−1)/2 + I with high probability. The idea is to compress the rows of A and the column of B from √ n to √ Z using a hash function or another compression algorithm able to build a re-ordering of the matrix A.Then the algorithm computes a dense matrix product between a √ Z × √ n matrix and a √ n × √ Z using the fast matrix multiplication algorithm. Other data structures and techniques are required in the paper and we refer to [11] for more details; however, they do not exploit TCU and do not affect the running time. By replacing the fast matrix multiplication with the result of Theorem 1, we get the following. Proof. The claim follows by observing that a multiplication between a matrix of size √ Z × √ n and a matrix of size √ n × √ Z, can be decomposed into n/Z matrices of size √ Z × √ Z, and each one can be solved with the algorithm of Theorem 1.
Discrete Fourier Transform
The Discrete Fourier Transform y of a n-dimensional (column) vector x can be defined as the matrix-vector product y = x T · W , where W is the Fourier matrix (or DFT matrix) and T denotes the transpose of a matrix/vector. The Fourier matrix W is a symmetric n × n matrix where the entry at row r and column c is defined as: W r,c = e −(2πi/n)rc . The Cooley-Tukey algorithm is an efficient and recursive algorithm for computing the DFT of a vector. The algorithm arranges x as an n 1 × n 2 matrix X (in row-major order) where n = n 1 · n 2 ; each column X * ,c is replaced with its DFT and then each entry X r,c is multiplied by the twiddle factor w rc n ; finally, each row X r, * is replaced by its DFT and the DFT of x is given by reading the final matrix X in column major.
For simplicity, we assume that the TCU model can perform operations (e.g., addition, products) on complex numbers; this assumption can be easily removed with a constant slow down in the running time: for instance, the multiplication between √ m × √ m complex matrices can be computed with four matrix multiplications and two sums of real values.
To compute the DFT of x using a (m, )-TCU, we use the Cooley-Tukey algorithm where we set n 1 = √ m and n 2 = n/ √ m (we assume all values to be integers). Then, we use the tensor unit for computing the n 2 DFTs of size n 1 = √ m by computing X T ·W √ m . Then, we multiply each element in X by its twiddle factor and transpose X. Finally, we compute the n 1 DFTs of size n 2 : if n 2 > √ m, the DFTs are recursively computed; otherwise, if n 2 ≤ √ m, the n 1 DFTs are computed with the multiplication X T · W n2 by using the tensor unit. 
Proof.
In each recursive level, we load matrix B = W √ m at the beginning and then compute the n 2 DFTs of n 1 entries by multiplying a n/ √ m × √ m matrix with W √ m in time O (n + ). The remaining operations (i.e., matrix transposition and multiplication by twiddle factors) cost O (n) time. The running time is thus given by the next recurrence, from which follows the statement.
We observe that we use as base case n ≤ m and not n ≤ √ m: indeed, when n ≤ m we use the tensor core for computing the n 2 DFTs of size n 1 = √ m, but also the n 1 DFTs of size n 2 ≤ √ m. This analysis gives a tighter upper bound than using n ≤ √ m as base case.
We observe that the above algorithm generalizes the approach used in [26] on an NVIDIA Verdi architecture. The paper decomposes the vector using n 1 = 4 and n 2 = n/4 and then solves subproblems of size 4 using a tensor core, since a TC can multiply 4 matrices. However, the algorithm in [26] does not seem to exploit batch DFTs (i.e., n/ √ m DFTs in one tensor operation) as in our algorithm.
Integer multiplication
We now study how to multiply two long integers by exploiting a (m, )-TCU. The input is given by two integers a and b of n bits each (without loss of generality, we assume both integers to be positives and n > m), and the output is the binary representation of c = a * b, of size 2n − 1. For this problem, we introduce in the design a third parameter κ, which is the bit length of a memory word that can be managed by a TCU model. We assume that κ = Ω (log n), that is there are enough bits in a word to store the input/output size. It is easy to see that the tensor unit can multiply matrices of (positive) integers of κ = κ/4 bits without overflow: the largest integer in the output matrix using κ bits is 2 2κ √ m which requires 2κ +log √ m < κ (if n >> m, then κ = κ/2−1 suffices).
We initially show how to speed up the long integer multiplication algorithm [15] , also known as the schoolbook algorithm, by exploiting the tensor unit. Then, we will use this algorithm to improve the recursive Karatsuba algorithm [13] .
Let A(x) be the following polynomial, with coefficients from N:
where n = n/κ and A i = (a (i+1)κ −1 . . . a iκ ) 2 is the integer given by the ith segment of κ bits of a. Let B(x) be defined similarly for b. We have that a = A(2 κ ) and b = B(2 κ ). We define C(x) = A(x) · B(x) and we observe that c is given by evaluating C(2 κ ). Note that A(X) and B(X) have degree n − 1, while c has degree at most (2n − 1)/κ ≤ 2n − 1.
The coefficients of C can be evaluated with the matrix multiplication C = A · B where: 1) B is the column vector with the n coefficients of B(X); 2) A is a (2n − 1) × n matrix where A i,j = A n −i+j−1 and we assume that A h = 0 if h < 0 or h ≥ n .
The product C = A · B cannot exploit TCU since B is a vector. To fully exploit an (m, )-TCU, we show how to calculate C coefficients via the multiplication
• Matrix B follows by considering vector B as the column major representation of a
• Matrix A is given be considering all segments Theorem 5. Two integers of n bits can be multiplied in a (m, )-TCU with κ bits operations in time
Proof. Let C h be the coefficient of C(x) of x h indeterminate. We have:
where as usual a i = 0 and b i = 0 if i < 0 or i ≥ n . By definition of matrices A and B , we can rewrite the previous equation as:
We observe that the last sum is including all entries in C i,j where h = 2(n − 1) − i − j √ m, as required in the algorithm. The algorithm then correctly computes all C h coefficients and hence c = a · b.
We now consider the running time. The cost of the matrix multiplication C = A · B is O Theorem 6. Two integers of n bits can be multiplied in a (m, )-TCU with κ bits operations in time
Proof. The running time follows by the recurrence:
Batch polynomial evaluation
We now show how to exploit the TCU for evaluating a given polynomial of A(x) = n−1 i=0 a i x i of degree n − 1 on p points p i , with 0 ≤ i < p. For simplicity we assume n to be a multiple of √ m, p ≥ √ m, and that the polynomial can be evaluated without overflow on the memory word available in the TCU model.
We initially compute for each p i the powers p
. We define the following matrices:
• A matrix X of size p × √ m where the ith row is
Stated differently, we consider the sequence a 0 , . . . , a n−1 as the column major representation of A.
We then compute the product C = X · A by exploiting the tensor unit. As in the previous section, we decompose A into √ m × √ m submatrices and then solve n/m multiplications. Then, for each p i , the values A(p i ) follows by the sum n/ √ m−1 j=0
Theorem 7. A polynomial of degree n − 1 can be evaluated on p points on a (m, )-TCU with κ bits operations in time
Proof. The correctness follows by observing that:
The cost of the initial exponentiation and of the final evaluation is O (p( √ m + n/ √ m)). The cost of computing C is given by invoking the tensor unit n/m times on two matrices of size p × √ m and
Relation with the external memory model
In this section we highlight a relation between the external memory model and the TCU model. We recall that the external memory model (also named I/O model, and almost equivalent to the ideal cache model) is a model capturing the memory hierarchy and consists of an external memory of potential unbounded size, of an internal memory of M ≥ 1 words, and a processor. The processor can only perform operations with data in the internal memory, and moves (input/output) blocks of B ≥ 1 words between the external memory and the internal memory. The I/O complexity of an algorithm for the external memory model is simply the number of blocks moved between the two memories. We refer to the excellent survey in [28] for a more exhaustive explanation.
The upper bounds derived for some of the previous problems recall the I/O complexity in the external memory model. For instance, the cost of dense matrix multiplication with only semiring operations (Theorem 2) is O n 3/2 / √ m when = O (1), while the I/O complexity for the same problem in the external memory model is O n 3/2 / √ m when B = O (1) [28] .
We observe that the product between two matrices of size √ m × √ m requires O (m) I/Os to load and storing the input in a memory of size M = 3m. Therefore any call to the tensor unit in a TCU can be simulated in the external memory of size M = 3m with Θ (m) I/Os. We now exploit this fact by providing a theorem showing that a lower bound in the external memory model translates into a lower bound in a weaker version of the TCU model. In the weak TCU model, the tensor unit can only multiply matrices of size √ m × √ m (i.e., we cannot exploit tall left matrices). We observe that any algorithm for the original TCU model can be simulated in the weak version with a constant slowdown when = O (m): indeed, the multiplication between an n × √ m matrix with a √ m× √ m can be executed in the weaker model by splitting the n × √ m matrix into n/ √ m matrices of size √ m × √ m and then performing n/ √ m matrix multiplication with total time O (n √ m).
Theorem 8. Consider a computational problem P with a lower bound F P on the I/O complexity in an external memory with memory size M = 3m + O(1) and block length B = 1. Then, any algorithm for P in the weak TCU model has Ω (F P ) time.
Proof. Consider an algorithm for the weak (m, )-TCU, and let T = T t + T o be the total running time with T = o (F P ): we denote with T t the running time due to tensor units, and with T o all the remaining operations. We can simulate the algorithm on an external memory with M = 3m + O(1) as follows. All standard operations are simulated using O(1) internal memory and incurring O (T o ) I/Os. Each call to the tensor unit can be simulated in the external memory by loading the two input matrices in the internal memory with O (m) I/Os, computing the product with no I/Os, and then writing the result in external memory with O (m) I/Os. If we have k calls to the tensor unit, the algorithm requires km = O (T t ) I/Os (recall that each call requires Θ (m) time). Thus the TCU algorithm gives an external memory algorithm with I/O complexity O (T t + T o ) = o (F P ), which is a contradiction. Therefore, we must have T = Ω (F P ).
Conclusion
In this paper, we have introduced the first computational model for tensor core units, namely the (m, )-TCU model. We have used this model for designing and analyzing several algorithms from linear algebra, broadening the class of algorithms that benefit of a fast hardware circuit for matrix multiplication. The paper leaves several open questions:
• The TCU model should be experimentally validated. Do experimental performance behave as predicted in the theoretical model? Can we use the TCU model for effectively model Google TPUs and NVIDIA TCs?
• Which other computational problems can benefit of a tensor unit? Can existing algorithms for deep learning on tensor cores be further improved?
• Hardware accelerators have parallel tensors and low numerical precision. How can we include these features in the TPU model, and to which extent do they affect TPU algorithm design?
