A Computational Model for Tensor Core Units by Chowdhury, Rezaul et al.
A Computational Model for Tensor Core Units
Rezaul Chowdhury1, Francesco Silvestri2, and Flavio Vella3
1Stony Brook University, USA, rezaul@cs.stonybrook.edu
2University of Padova, Italy, silvestri@dei.unipd.it
3Free University of Bozen, Italy, flavio.vella@unibz.it
Abstract
To respond to the need of efficient training and inference of deep neural networks, a plethora of
domain-specific hardware architectures have been introduced, such as Google Tensor Processing
Units and NVIDIA Tensor Cores. A common feature of these architectures is a hardware circuit
for efficiently computing a dense matrix multiplication of a given small size.
In order to broaden the class of algorithms that exploit these systems, we propose a computa-
tional model, named the TCU model, that captures the ability to natively multiply small matrices.
We then use the TCU model for designing fast algorithms for several problems, including matrix
operations (dense and sparse multiplication, Gaussian Elimination), graph algorithms (transitive
closure, all pairs shortest distances), Discrete Fourier Transform, stencil computations, integer
multiplication, and polynomial evaluation. We finally highlight a relation between the TCU
model and the external memory model.
1 Introduction
Deep neural networks are nowadays used in several application domains where big data are available,
and have led to breakthroughs, such as reducing word error rates in speech recognition by 30% over
traditional approaches [10] and cutting the error rate in an image recognition competition from
26% to 3.5% [11]. The huge data set size, although crucial for improving neural network quality,
rises performance issues during the training and inference steps. To respond to the increasing
computational needs, domain-specific hardware accelerators have been introduced by several IT
firms, such as Google Tensor Processing Units [13], NVIDIA Tensor Cores [21], Intel KNL’s AVX
extensions [25], Apple Neural Engine [2], and ARM’s Machine Learning Processor [3] among the
others. These compute units have been specifically developed for some deep learning models, such
as multilayer perceptrons, convolutional neural networks, and recurrent neural networks.
These accelerators significantly vary in hardware architectures; however, they share circuits to
efficiently multiply small and dense matrices of fixed size. Indeed, matrix multiplication is one of
the most important computational primitives in deep learning. By using the terminology introduced
in [9], we refer to all accelerators supporting hardware dense matrix multiplication as Tensor Core
Units (TCUs) (or simply tensor units). By focusing on a specific computational problem, namely
matrix multiplication, TCUs exhibit at the same time both high performance and low energy
consumption with respect to traditional CPU or GPU approaches [13].
TCUs are becoming the mainstream technology for deep learning, with constantly decreasing
economic costs and a tighter integration with the main processing unit. Although TCUs were
developed for domain-specific problems, it would be interesting and profitable to extend their
1
ar
X
iv
:1
90
8.
06
64
9v
2 
 [c
s.D
S]
  9
 Ju
l 2
02
0
application domain, for instance by targeting problems from linear algebra, data mining or machine
learning (other than deep learning). A similar scenario appeared with the introduction of GPUs:
introduced in the 2000s for accelerating computer graphics (in primis, video games), GPUs have been
used since then for very different computational problems, like bioinformatics [20], data mining [6],
and neural networks [27]. Will TCUs have the same wide impact of GPUs?
The goals of this paper are to present a framework for designing and analyzing efficient algorithms
for TCUs, and to expand the class of algorithms that exploit TCUs. We first introduce in Section 3
a computational model for tensor core units, which we call (m, `)-TCU, that captures the main
features of tensor units:
1. High performance matrix multiplication. For a given model parameter m ≥ 1, two
matrices of size
√
m×√m can be multiplied in O (m) time by using the hardware circuit in
tensor units.
2. Latency cost. The model parameter ` captures the latency cost for setting up the tensor
unit and preparing the input/output operands.
3. Asymmetric behavior. Some tensor units can efficiently process a left matrix with a large
number n of rows (i.e., a tall left matrix). Thus, we let the (m, `)-TCU to natively multiply
an n×√m matrix by a √m×√m matrix, without splitting the left matrix into submatrices
of size
√
m×√m. The multiplication requires O (n√m) time and we let n ≥ √m to be a user
defined value.
In Section 4, we design several algorithms that exploit tensor accelerators and analyze their
performance on the (m, `)-TCU model. More specifically, we show how to compute some matrix
operations (dense and sparse multiplication, Gaussian Elimination), graph problems (transitive
closure, all pairs shortest distances), the Discrete Fourier Transform, a class of stencil computations,
integer multiplication, and polynomial evaluation. These algorithms give evidence that TCUs can be
potentially used for different computational problems, in addition to deep neural networks. Finally
in Section 5, we observe that some lower bounds on the I/O complexity in the external memory
model [30] translate into lower bounds on the running time in the TCU model.
We observe that, from a theoretical point of view, TCUs will not be needed if it is developed an
algorithm for multiplying two
√
n×√n matrices in O (n) time (i.e., ω = 2). However, from a more
realistic point of view, it is very unlikely that such algorithm will have experimental performance
equivalent to the state of the art, in particular of hardware implementations. A rigorous approach to
TCUs is an important step to fully exploit tensor accelerators and to further improve the performance
of algorithms, but also to better understand the generality of matrix multiplication as computational
primitive.
1.1 Previous results
Tensor Core Units. The literature on tensor core units has mainly focused on architectural
issues, see e.g. [13, 32, 24]. Some works, like [19, 22], have investigated the programming model and
performance of deep neural networks workloads in the NVIDIA Tensor Cores.
To the best of our knowledge, the only papers that broaden the class of algorithms expressible as
TCU operations are [9, 7, 28]. The papers [9, 7] design algorithms for scanning and reduction that
exploit NVIDIA tensor cores. In [28], it is shown how to speed up the Discrete Fourier Transform
(DFT) by exploiting the half precision multiplication capability of NVIDIA tensor cores. The
algorithm in [28] uses the Cooley-Tukey algorithm where DFTs of size 4 are computed using tensor
cores, and it is a special case of the TCU algorithm proposed in this paper in Section 4.5. However,
2
none of the previous works has proposed a rigorous method for studying how to accelerate algorithms
with TCUs.
Matrix multiplication From a practical point of view, the most efficient algorithms for dense
matrix multiplication are those based on the definition of matrix multiplication and which require
Θ
(
n3/2
)
operations for multiplying two
√
n×√n matrices (see e.g. the BLAS library). Nevertheless
from a theoretical point of view, several papers have been investigating algorithms requiring O
(
nω/2
)
operations for some ω < 3. The work of Strassen [29] showed that ω ≤ 2.81, and then subsequent
works have been improving the upper bound on ω, up to the current best result ω < 2.3728 [31, 17].
Some works, like [5, 12], investigate how to use fast matrix multiplication to compute problems
like triangle listing and sparse matrix multiplication. The results in [5] show how to list t triangles
in a graph with m edges in O
(
m2ω/(ω+1) +m3(ω−1)/(ω+1)t(3−ω)/(ω+1)
)
time. In [12], it is shown how
to compute a sparse matrix multiplication in time O˜
(√
nZ(ω−1)/2 + I
)
, where I is the number of
non-zeros in the input
√
n×√n matrices and Z is the number of non-zero in the output matrix.
These algorithms automatically work in the (m, `)-TCU model by replacing the O
(
nω/2
)
running
time of fast matrix multiplication in the RAM model, with the O
(
nω/2(m+`)/mω/2−1
)
running time
in the TCU model (see Theorem 1).
2 Preliminaries
2.1 Technical overview on some TCUs
We now briefly describe the main characteristics of the most relevant hardware accelerators for
deep learning: Google’s Tensor Processing Unit (TPU) [13] and NVIDIA’s Tensor Cores [21] (TCs).
They are used to accelerate convolution layers and the related matrix multiplication operations,
which represent the most computationally expensive part of deep learning applications.
The Tensor Processing Unit (TPU) is an application-specific integrated circuit developed for
accelerating the inference phase of deep neural networks. A TPU consists of an ALU Matrix Multiply
Unit (MMU) and two on-chip memories, called Unified Buffer and Weight Memory. TPU has been
designed as an accelerator to plug into a traditional server as GPUs do through PCIe I/O bus. Data
are sent from the CPU host memory to the local TPU memories with the goal of offloading all the
computation. The MMUs are composed by 256 × 256 8-bit multiplier-accumulator units, where
the 16-bit products are stored in 32-bit accumulators. A systolic execution of the units reduces
the overhead and maximizes the throughput, with 256-element partial sum per clock cycle. Briefly,
the typical TPU workflow is summarized as follows: (1) Read a variable-sized n× 256 input, with
n < 96K from the CPU host memory into the Unified Buffer memory; (2) Read a 256× 256 matrix
into Weight Memory and then into the MMUs; (3) Perform a matrix multiplication; (4) Write the
n× 256 output from the TPU (Unified Buffer) to the CPU host memory. We observe that, using
our notation, step 1 reads the left matrix A, while step 2 reads matrix B.
In the NVIDIA ”Volta” architecture, Tensor Cores extend the traditional GPU architectures and
the parallel programming interface (CUDA) by providing dedicated units to efficiently perform dense
matrix multiplication. The Volta micro-architecture revised NVIDIA Streaming Multiprocessors
(SM) design: the SM consists of 4 processing blocks. Each block contains: 2 Tensor Cores, 8
Floating Point (FP) units operating at 64-bit, 16 FP operating at 32-bit, 8 Integer Unit operating
at 32-bit and one Special Function Unit. Concerning the memory hierarchy, the L1 cache and the
shared memory are located in the same in-chip surface. The L2 is also included in the same die
and it is accessed among multiple SMs. The High Bandwidth Memory (HBM) can be addressed
3
Figure 1: A 3x3 systolic array after loading matrix B.
by a 4096-bit memory interface. Tensor Cores can perform 64 floating-point Fused-Multiply-Add
(FMA) operations in one cycle. FMA operates in half-precision (16-bit) and optionally stores a
32-bit output by using a sum accumulator. Such design can deliver a peak of 31.4 Tera Floating
Point Operations Per Second (FLOPS) with half precision on a Tesla V100 (640 Tensor Cores
spread over 80 SM). From the programming point of view, although TC basically performs one
matrix-multiply and accumulate operation on 4× 4 matrices, it is possible to multiply a 16× 16
matrix at programming level with one CUDA warp (32 threads). This requires primitives for data
loading and the synchronization of the result from/to registers through load and store units.
2.2 Systolic algorithms for matrix multiplication
The circuits which implement matrix multiplication in the Google TPU and in the NVIDIA TC
adopt a systolic algorithm for matrix multiplication. A systolic algorithm is an algorithm for a
systolic array, that is a network of processing elements (PEs) that rhythmically compute and pass
data through the system [18].
For the sake of completeness, we formalize the systolic algorithm implemented in the Google
TPU [13]. The implementation on NVIDIA TCs is slightly different but shares the same high level
structure, and we refer to [18] for a more complete overview of systolic algorithms. The systolic
algorithm is implemented on a 2-dimensional array of m PEs, and we denote with pi,j the PE at row
i and column j, for each 0 ≤ i, j < √m. Let A and B be the two √m×√m input matrices, and
let C = A ·B be the √m×√m output matrix; we denote with ai,j , bi,j , ci,j the entry in row i and
column j of A, B, C respectively; for notational simplicity, we let ai,j = 0 if i, j < 0 or i, j ≥
√
m.
The algorithm works as follows (see also Figure 1):
• In the first √m steps, matrix B is pushed within the m PEs so that pi,j contains bi,j .
• The algorithm then executes 3√m steps. In each step k, with 0 ≤ k < 2√m, each PE pi,j
receives: 1) an entry a of A from the left PE pi,j−1 or the input ak−i,i if j = 0; 2) a partial
sum c from the top PE pi−1,j , or we set c = 0 if i = 0. Then, each pi,j computes c = c+ a · bi,j
(recall that bi,j is in the local memory of pi,j . Finally pi,j forwards a to the right PE (pi,j+1, if
any) and c to the bottom PE (pi+1,j or it is output if i =
√
n− 1).
• We observe that each p√m−1,j outputs ci,j at the end of step k =
√
m+ i+ j.
We observe that the algorithm can be extended to compute C = A ·B where A is an n×√m
matrix and B is a
√
m×√m matrix, by just continuing pumping all rows within the system. This
feature is not available in the NVIDIA implementation since matrix B does not reside in the local
PE memories, but it is percolated within the array as matrix A.
4
3 The (m, `)-TCU model
We propose a computational model for tensor core units that captures the following three properties.
1: Matrix acceleration. The hardware circuits implement a parallel algorithm to multiply two
matrices of a fixed size, and the main cost is dominated by reading/writing the input and output
matrices. For a given hardware parameter m, we have that the multiplication of two matrices A
and B of size
√
m×√m are implemented in time O (m). With time, we mean the running time as
seen by the CPU clock and it should not be confused with the total number of operations executed
by the unit, which is always Θ
(
m3/2
)
. Indeed, no existing tensor unit implements fast matrix
multiplication algorithms, as for instance Strassen. The matrix multiplication operation is called
by an instruction specifying the address (in memory) of the two input matrices and of the output
matrix where the result will be stored; data will be loaded/stored by the tensor unit.
2: Latency cost. A call to the tensor unit has a latency cost. As the state of the art tensor units
use systolic algorithms, the first output entry is computed after Ω (
√
m) time. There are also initial
costs associated with activation, which can significantly increase when the unit is not connected to
the CPU by the internal system bus or is shared with other CPUs. We thus assume that the cost of
the multiplication of two matrices of size
√
m×√m is O (m+ `), where ` > 0 is the latency cost.
3: Asymmetric behavior. As tensor units are designed for improving training and inference
in deep networks, the two matrices in the multiplication A× B are managed differently. Matrix
B represents the model (i.e., the weights of the deep neural network), while the rows of matrix A
represent the input vectors to be evaluated. As the same model can be applied to k vectors, with
n >>
√
m, it is possible to first load the weights in B and then to stream the n rows of A into the
tensor unit (possibly in chunks of
√
m rows), reducing thus latency costs. Thus, we assume in our
model that two matrices of size n×√m and √m×√m are multiplied in time O (n√m+ `), where
the number n of rows is defined by the algorithm and n ≥ √m.
More formally, we define the Tensor Computing Unit (TCU) model as follows. The (m, `)-TCU
model is a standard RAM model where the CPU contains a circuit, named tensor unit, for performing
a matrix multiplication A×B of size n×√m and √m×√m in time O (n√m+ `), where m ≥ 1
and ` ≥ 0 are two model parameters and n ≥ √m is a value (possibly input dependent) specified
by the algorithm. The matrix operation is initialized by a (constant size) instruction containing
the addresses in memory of the two input matrices A and B, of the output matrix C, and the
row number n of A. The running time (or simply time) of a TCU algorithm is given by the total
cost of all operations performed by the CPU, including all calls to the tensor unit. We assume no
concurrency between tensor unit, memory and CPU, and hence at most one component is active at
any time. Each memory (and TPU) word consists of κ bits (in general, we denote κ = Ω (log n)
where n is the input size, that is enough for storing the input size in one word.
3.1 Discussion on the model
In this preliminary work, our goal is to understand how to exploit circuits of fixed size for matrix
multiplication. We then do not include in the model some characteristics of existing hardware
accelerators, like limited numerical precision and parallel tensor units. In particular, the modeling
of only a single tensor unit can be seen as a major weakness of our model since existing boards
contain a large number of TPUs (e.g., more than 500 TC in the Nvidia Titan RTX). However, we
5
believe that the first step to exploit tensor accelerators is to investigate which problems can benefit
of matrix multiplication circuits; we have then opted for a simple model with only a TCU. Moreover,
existing hardware accelerators use different parallel architectures and interconnection networks,
while they agree on matrix multiplication as main primitive. We now make some considerations on
how Google TPUs and NVIDIA TCs fit our model.
In the Google TPU (in the version described in [13]), the right matrix B has size 256×256 words
(i.e., m = 65536). The left matrix A is stored in the local unified buffer of 96k × 256 words; thus,
TPUs can compute the product between two matrices of size 96k×256 and 256× 256 in one (tensor)
operation. The number of rows of the left matrix in the TCU model is a user defined parameter
(potentially a function of the input size); on the other hand, the number of rows of the left matrix in
the TPU is user defined but it is upper bounded by a hardware-dependent value (i.e., 96K). Being
this bound quite large, a TPU better exploits a tall left matrix than a short one, as in our TCU
model. The systolic array works in low precision with 8 bits per word (κ = 8). The bandwidth
between CPU and TPU was limited in the first version (16GB/s), but it is significantly higher
in more recent versions (up to 600 GB/s). Although TPU has a quick response time, the overall
latency is high because the right hand matrix has to be suitably encoded via a TensorFlow function
before loading it within the TPU: in fact, the TPU programming model is strongly integrated with
TensorFlow, and it does not allow to use bare matrices as inputs. The high latency cost might
mitigate the fact that our model does not capture limited bandwidth.
The programming model of the NVIDIA Volta architecture allows one to multiply matrices of
size 16 × 16, although the basic hardware unit works on 4 × 4 matrices; we thus have m = 256.
Memory words are of κ = 16 bits. TCs exhibit high bandwidth and low latency, as data are provided
by a high bandwidth memory shared with the GPU processing units. Matrices A and B can be
loaded within TCs without a special encoding as in TPUs, since the NVIDIA Volta natively provides
support for matrix multiplication. Finally we observe that, as TCs are within a GPU, any algorithm
for TCs has also to take into account GPU computational bottlenecks (see e.g. [15, 1]).
4 Algorithms
4.1 Matrix multiplication
Dense matrix multiplication
A Strassen-like algorithm for matrix multiplication is defined in [4] as a recursive algorithm that
utilizes as base case an algorithm A for multiplying two √n0 × √n0 matrices using p0 element
multiplications and O (n0) other operations (i.e., additions and subtractions); we assume n0 = O (p0).
Given two
√
n×√n matrices with n > n0, a Strassen-like algorithm envisions the two
√
n×√n
matrices as two matrices of size
√
n0×√n0 where each entry is a submatrix of size
√
n/n0×
√
n/n0:
then, the algorithm recursively computes p0 matrix multiplications on the submatrices (i.e., the p0
element multiplications in A) and then performs O (n) other operations. For given parameters p0 and
n0, the running time of the algorithm is T (n) = O (n
ω0), where1 ω0 = logn0 p0. By setting n0 = 4
and p0 = 8, we get the standard matrix multiplication algorithm (ω0 = 3/2), while with n0 = 4
and p0 = 7 we get the Strassen algorithm (ω0 = log4 7 ∼ 1.403). Any fast matrix multiplication
algorithm can be converted into a Strassen-like algorithm [23].
The TCU model can be exploited in Strassen-like algorithms by ending the recursion as soon as
a subproblem fits the tensor unit: when n ≤ m, the two input matrices are loaded in the tensor
1We observe that ω0 corresponds to ω/2, where ω is the traditional symbol used for denoting the exponent in fast
matrix multiplication algorithms.
6
unit and the multiplication is computed in O (m) time. We assume m ≥ n0, otherwise the tensor
unit would not be used.
Theorem 1. Given a Strassen-like algorithm with parameters n0 and p0, then there exists a TCU
algorithm that multiplies two
√
n×√n matrices on an (m, `)-TCU model, with m ≥ n0, in time
T (n) = O
(( n
m
)ω0
(m+ `)
)
.
Proof. The running time is given by the following simple recursion which assumes n ≥ m:
T (n) =
{
O
(
n3/2
m1/2
+ nm`
)
if m ≤ n ≤ mn0;
p0T (n/n0) +O (n) otherwise .
By solving the recurrence, we get
T (n) = O
(
n0
p0
( n
m
)ω0
(m
√
n0 + `)
)
,
where ω0 = logn0 p0. Since n0 and p0 are independent of n, we get the claimed result.
The standard recursive matrix multiplication algorithm gives O
(
n3/2/m1/2 + (n/m)3/2`)
)
time.
With the Strassen algorithm, we get O
(
n1.4037/m0.4037 + (n/m)1.4037`
)
time
We now show how to decrease the latency cost, i.e., (n/m)3/2`, in the TCU algorithm based
on the standard algorithm. The idea is to keep as much as possible the right matrix B within the
tensor unit by using a tall left matrix A. We split the left matrix A into
√
n/m blocks Ai of size√
n×√m (i.e., vertical strips of width √m) and the right matrix B into square blocks Bi,j of size√
m ×√m, with 0 ≤ i, j < √n/m. Then, we compute Ci,j = Ai · Bi,j for each 0 ≤ i, j < √n/m
using the tensor unit in time O (
√
nm+ `). The final matrix C follows by computing the
√
n×√m
matrices Ci =
∑√n/m−1
j=0 Ci,j .
Theorem 2. There exists an algorithm that multiplies two
√
n×√n matrices in the (m, `)-TCU
model in time
T (n) = Θ
(
n3/2
m1/2
+
n
m
`
)
.
The algorithm is optimal when only semiring operations are allowed.
Proof. Each multiplication Ci,j = Ai · Bi,j requires O (n
√
m+ `) time. Since there are n/m such
multiplications, the upper bound follows. The cost of the final summation is negligible.
When using only semiring operations, any algorithm must compute n3/2 elementary products.
Since each call to a tensor computes m3/2 elementary products in Θ (m) time using a systolic
algorithm, we need Ω
(
n3/2/m1/2
)
time. Furthermore, since all entries of B must be loaded in the
tensor unit at least once and we cannot load more than m entries in B per tensor operation, the
algorithm has to load at least n/m distinct right matrices in the tensor unit; then a Ω (`n/m) lower
bound on the time also follows.
From the previous Theorem 2, we get the following corollary for rectangular matrices (a similar
result holds also when using the algorithm for fast matrix multiplication in Theorem 1).
7
1. for k ← 1 to √n− 2 do
2. for i← k + 1 to √n− 1 do
3. for j ← k + 1 to √n do
4. c[i, j]← c[i, j]+
(
− c[i,k]
c[k,k]
)
×c[k, j]
Figure 2: Forward phase of Gaussian elimination.
Corollary 1. A
√
n× r matrix can be multiplied by an r ×√n matrix in the (m, `)-TCU model in
time
T (n, r) = Θ
(
rn
m1/2
+
r
√
n
m
`
)
,
assuming n, r2 ≥ m.
Proof. It suffices to decompose the problem into products of size s× s where s = min{√n, r} and
then apply Theorem 2.
Sparse matrix multiplication
A TCU algorithm to multiply two sparse matrices follows from the work [12] that uses as a black
box a fast matrix multiplication algorithm for multiplying two
√
n × √n matrices in O (nω/2)
time. Let I be the number of non-zero entries in the input matrices A and B, and let Z be the
number of non-zero entries in the output C = A ·B. We consider here the case where the output is
balanced, that is there are Θ (Z/
√
n) non-zero entries per row or column in C; the more general
case where non-zero entries are not balanced is also studied in [12] and can be adapted to TCU
with a similar argument. The algorithm in [12] computes the output in time O˜
(√
nZ(ω−1)/2 + I
)
with high probability. The idea is to compress the rows of A and the column of B from
√
n to
√
Z
using a hash function or another compression algorithm able to build a re-ordering of the matrix A.
Then the algorithm computes a dense matrix product between a
√
Z ×√n matrix and a √n×√Z
using the fast matrix multiplication algorithm. By replacing the fast matrix multiplication with the
result of Theorem 1, we get the following.
Theorem 3. Let A and B be two sparse input matrices of size
√
n×√n with at most I non-zero
entries, and assume that C = A ·B has at most Z non-zero entries evenly balanced among rows and
columns. Then there exists an algorithm for the (m, `)-TCU model requiring time:
T (n,Z, I) = O
(√
n
Z
(
Z
m
)ω0
(m+ `) + I
)
,
when Z ≥ m and where ω0 = logn0 p0 is the exponent given by a Strassen-like algorithm.
Proof. The cost is dominated by the matrix product between a
√
Z ×√n matrix and a √n×√Z
using the fast matrix multiplication algorithm in Theorem 1,
4.2 Gaussian Elimination without Pivoting
Gaussian elimination without pivoting is used in the solution of systems of linear equations and LU
decomposition of symmetric positive-definite or diagonally dominant real matrices [8]. We represent
8
Figure 3: Blocked version of Gaussian elimination. It shows which functions update which blocks in
iteration k = 3 of the outermost for loop of GE-forward.
a system of r − 1 equations in r − 1 unknowns (x1, x2, . . . , xr−1) using an r × r matrix c, where the
i’th (1 ≤ i < r) row represents the equation ai,1x1 + ai,2x2 + . . .+ ai,r−1xr−1 = bi:
c =

a1,1 a1,2 . . . a1,
√
n−1 b1
a2,1 a2,2 . . . a2,
√
n−1 b2
...
...
. . .
...
...
a√n−1,1 a√n−1,2 . . . a√n−1,√n−1 b√n−1
0 0 . . . 0 0

The method proceeds in two phases. In the first phase, an upper triangular matrix is constructed
from c by successive elimination of variables from the equations. This phase requires Θ
(
r3
)
time
(see code in Figure 2). In the second phase, the values of the unknowns are determined from this
matrix by back substitution. It is straightforward to implement this second phase in Θ
(
r2
)
time, so
we will concentrate on the first phase.
Our TCU algorithm for the forward phase of Gaussian elimination without pivoting is shown
in Figure 4. The algorithm is invoked as GE-forward( c ), where c is the
√
n × √n matrix
representing a system of
√
n− 1 equations with √n− 1 unknowns. Figure 3 shows which function
∈ {A,B,C,D} update which block in iteration k = 3 of the outermost for loop of GE-forward.
In this algorithm only the calls to function D (in line 10) which multiplies
√
m×√m matrices are
executed on the TCU. In each iteration of the loop in line 8, X ′j is loaded into the TCU as the
weight matrix, and the
(√
n/m− k
)√
m =
√
n − k√m rows of the √n/m − k submatrices Xik
inside the loop in line 9 are streamed through the TCU.
Theorem 4. The forward phase of Gaussian elimination without pivoting applied on a system of√
n− 1 equations with √n− 1 unknowns can be performed in the (m, `)-TCU model in time
T (n) = Θ
(
n3/2
m1/2
+
n
m
`+ n
√
m
)
.
This complexity reduces to the optimal cost of multiplying two dense
√
n×√n matrices (see Theorem
2) when
√
n ≥ m.
Proof. The outermost loop of GE-forward in line 2 is executed
√
n/m times. In each itera-
tion the cost of executing line 3 is Θ
(
m3/2
)
, and that of executing the loops in lines 4 and 6
9
GE-forward( X )
(X points to the
√
n×√n input matrix c. We assume
that m divides n, where
√
m × √m is the size of the
matrix multiplication unit of the TCU.)
1. Split X into
√
n
m
×
√
n
m
square submatrices of
size
√
m×√m each. The submatrix of X at the
i-th position from the top and the j-th position
from the left is denoted by Xij . X
′ is a
√
m×√
n matrix split into
√
m × √m submatrices,
where the submatrix at j-th position from the
left is denoted by X′j .
2. for k ← 1 to √n/m do
3. A( Xkk )
4. for j ← k + 1 to √n/m do
5. B( Xkj , Xkk, X
′
j )
6. for i← k + 1 to √n/m do
7. C( Xik, Xkk )
8. for j ← k + 1 to √n/m do
9. for i← k + 1 to √n/m do
10. D( Xij , Xik, X
′
j )
D( X, Y, Z )
(X, Y and Z point to disjoint
√
m×√mmatrices, where√
m×√m is the size of the matrix multiplication unit
of the TCU.)
1. for k ← 1 to √m do
2. for i← 1 to √m do
3. for j ← 1 to √m do
4. X[i, j]← X[i, j] + Y [i, k]× Z[k, j]
A( X )
(X points to a
√
m×√m matrix, where √m×√m is the size
of the matrix multiplication unit of the TCU.)
1. for k ← 1 to √m− 1 do
2. for i← k + 1 to √m do
3. for j ← k + 1 to √m do
4. X[i, j] ← X[i, j] −
(X[i, k]×X[k, j]) /X[k, k]
B( X, Y, X′ )
(X, Y and X′ point to disjoint
√
m × √m matrices, where√
m×√m is the size of the matrix multiplication unit of the
TCU.)
1. for k ← 1 to √m− 1 do
2. for i← k + 1 to √m do
3. for j ← 1 to √m do
4. X[i, j] ← X[i, j] −
(Y [i, k]×X[k, j]) /Y [k, k]
5. for i← 1 to √m do
6. for j ← 1 to √m do
7. X′[i, j]← −X[i, j]/Y [i, i]
C( X, Y )
(X and Y point to disjoint
√
m×√m matrices, where √m×√
m is the size of the matrix multiplication unit of the TCU.)
1. for k ← 1 to √m do
2. for i← 1 to √m do
3. for j ← k + 1 to √m do
4. X[i, j] ← X[i, j] −
(X[i, k]× Y [k, j]) /Y [k, k]
Figure 4: TCU algorithm for Gaussian elimination without pivoting which is called as GE-
forward( c ), where c is the
√
n×√n matrix representing a system of √n− 1 equations with √n− 1
unknowns.
is Θ
(
m3/2
(√
n/m− k
))
. Then in each of the
√
n/m − k iterations of the loop in line 8 we
load X ′j as the weight matrix into the TCU and then lines 9–10 are executed by streaming the(√
n/m− k
)√
m =
√
n− k√m rows of Xk+1,k, Xk+2,k . . . X√n/m,k through the TCU. Total cost
of lines 8–10 is thus Θ
((√
n/m− k
)
((
√
n− k√m)√m+ l)
)
. The total cost over all iterations of
the loop in line 2 is then
Θ

√
n/m∑
k=1
(
m3/2 +
(√
n/m− k
)(
m3/2 +
(√
n− k√m)√m+ l))

10
from which we get Θ
(
n3/2
m1/2
+ nm`+ n
√
m
)
. When
√
n ≥ m ⇒ n3/2
m1/2
≥ n√m, the cost reduces to
Θ
(
n3/2
m1/2
+ nm`
)
which matches the optimal cost of multiplying two dense
√
n×√n matrices (see
Theorem 2).
4.3 Graph Transitive Closure
1. for k ← 1 to n do
2. for i← 1 to n do
3. for j ← 1 to n do
4. d[i, j]← d[i, j] ∨ (d[i, k] ∧ d[k, j])
Figure 5: Iterative algorithm for computing the transitive closure of an n vertex graph given its
n× n djacency matrix d with d[i, j] = 1 if there is an edge from vertex i to vertex j and d[i, j] = 0
otherwise.
Figure 6: Blocked version of transitive closure. It shows which functions update which blocks in
iteration k = 3 of the outermost for loop of Transitive-Closure.
For an n-vertex directed graph G, its transitive closure is given by an n× n matrix c[1..n, 1..n],
where for all i, j ∈ [1, n], c[i, j] = 1 provided vertex j is reachable from vertex i and c[i, j] = 0
otherwise. Figure 5 shows how to compute c in Θ
(
n3
)
time by updating the 0/1 adjacency matrix
d[1..n, 1..n] of G in place. The algorithm is similar to the standard iterative matrix multiplication
algorithm except that bitwise-AND (∧) and bitwise-OR (∨) replace multiplication (×) and addition
(+), respectively.
Figure 7 shows the blocked version of the algorithm given in Figure 5. Figure 6 shows which
function ∈ {A,B,C,D} updates which block in iteration k = 3 of the outermost for loop of
Transitive-Closure. However, we observe that function D which updates block X using data
from blocks Y and Z that are disjoint from X can be implemented to use “×” and “+” instead
of “∧” and “∨”, respectively, provided we set X[i, j]← min (X[i, j], 1) for all i, j after it completes
updating X. Function D is invoked in line 11 of Transitive-Closure almost n
2
m times. We
execute lines 1– 4 of function D (which represent standard multiplication of two
√
m×√m matrices)
on a TCU. In each iteration of the loop in line 8, Xkj is loaded into the TCU as the weight matrix,
and the (n/
√
m− 1)√m = √n−√m rows of the n/√m− 1 submatrices Xik inside the loop in line
9 are streamed through the TCU.
11
Transitive-Closure( X )
(X points to the n× n input 0/1 matrix d. We assume
that m divides n, where
√
m × √m is the size of the
matrix multiplication unit of the TCU.)
1. Split X into n√
m
× n√
m
square submatrices of
size
√
m×√m each. The submatrix of X at the
i-th position from the top and the j-th position
from the left is denoted by Xij .
2. for k ← 1 to n√
m
do
3. A( Xkk )
4. for j ← 1 to n√
m
do
5. if j 6= k then B( Xkj , Xkk )
6. for i← 1 to n√
m
do
7. if i 6= k then C( Xik, Xkk )
8. for j ← 1 to n√
m
do
9. for i← 1 to n√
m
do
10. if i 6= k and j 6= k then
11. D( Xij , Xik, Xkj )
A( X )
(X points to a
√
m×√m 0/1 matrix, where √m×√m
is the size of the TCU matrix multiplication unit.)
1. for k ← 1 to √m do
2. for i← 1 to √m do
3. for j ← 1 to √m do
4. X[i, j]← X[i, j] ∨ (X[i, k] ∧X[k, j])
B( X, Y )
(X, Y and X′ point to disjoint
√
m×√m 0/1 matrices, where√
m×√m is the size of the TCU matrix multiplication unit.)
1. for k ← 1 to √m do
2. for i← 1 to √m do
3. for j ← 1 to √m do
4. X[i, j]← X[i, j] ∨ (Y [i, k] ∧X[k, j])
C( X, Y )
(X and Y point to disjoint
√
m × √m 0/1 matrices, where√
m×√m is the size of the TCU matrix multiplication unit.)
1. for k ← 1 to √m do
2. for i← 1 to √m do
3. for j ← 1 to √m do
4. X[i, j]← X[i, j] ∨ (X[i, k] ∧ Y [k, j])
D( X, Y, Z )
(X, Y and Z point to disjoint
√
m×√m 0/1 matrices, where√
m×√m is the size of the TCU matrix multiplication unit.)
1. for k ← 1 to √m do
2. for i← 1 to √m do
3. for j ← 1 to √m do
4. X[i, j]← X[i, j] + (Y [i, k]× Z[k, j])
5. for i← 1 to √m do
6. for j ← 1 to √m do
7. if X[i, j] > 1 then X[i, j]← 1
Figure 7: TCU algorithm for computing transitive closure of an n-vertex graph which is called as
Transitive-Closure( d ), where d is the n× n adjacency matrix of the graph with d[i, j] = 1 if
vertices i and j are adjacent and d[i, j] = 0 otherwise.
Theorem 5. The transitive closure of an n-vertex directed graph can be computed in the (m, `)-TCU
model in time
T (n) = Θ
(
n3√
m
+
n2
m
`+ n2
√
m
)
.
This complexity reduces to the optimal cost of multiplying two dense n× n matrices (see Theorem 2)
when n ≥ m.
Proof. The proof is similar to the proof of Theorem 4 and is obtained by analyzing the steps of
Transitive-Closure shown in Figure 7.
4.4 All Pairs Shortest Distances (APSD)
We discuss TCU implementation of Seidel’s algorithm [26] for computing APSD in an unweighted
undirected graph G = (V,E), where n = |V | and vertices are numbered by unique integers from 1
12
to n.
Let A be the adjacency matrix of G. The adjacency matrix A(2) of the squared graph G(2) =
(V,E(2)) is obtained by squaring A and replacing all nonzero entries in the square matrix by 1.
Indeed, for any given pair of vertices u, v ∈ V , (u, v) ∈ E(2) (i.e., A(2)[u, v] = 1) provided there exists
a vertex w ∈ V such that (u,w), (w, v) ∈ E (i.e., A[u,w] = A[w, v] = 1). Let δ(u, v) and δ(2)(u, v)
represent the shortest distance from u to v in G and G(2), respectively. Seidel shows that if all δ(2)
values are known one can correctly compute all δ(u, v) values from them. Let D(2) be the distance
matrix of G(2) and let C = D(2)A. Then Seidel shows that for any pair u, v ∈ V , δ(u, v) = 2δ(2)(u, v)
provided
∑
(w,v)∈E D
(2)[u,w] = C[u, v] ≥ deg(v)×D(2)[u, v], and δ(u, v) = 2δ(2)(u, v)− 1 otherwise,
where deg(v) is the number of neighbors of v in G. Thus the distance matrix D of G can be
computed from D(2) by computing C = D(2)A. The D(2) matrix is computed recursively. The base
case is reached when we encounter G(h) where h = dlog2(n)e. It’s adjacency matrix A(h) has all 1’s,
and it’s distance matrix is simply D(h) = A(h) − In. Clearly, there are h levels of recursion and in
each level we compute two products of two n× n matrices. Hence, using Theorem 1 we obtain the
following.
Theorem 6. All pairs shortest distances of an n-vertex unweighted undirected graph can be computed
in the (m, `)-TCU model in time
T (n) = O
((
n2/m
)ω0 (m+ `) log n) .
Proof. The claim follows since we have O (log n) matrix multiplications of size n × n, each one
requiring O
((
n2/m
)ω0 (m+ `)) for Theorem 1.
4.5 Discrete Fourier Transform
The Discrete Fourier Transform y of an n-dimensional (column) vector x can be defined as the
matrix-vector product y = xT ·W , where W is the Fourier matrix (or DFT matrix) and T denotes
the transpose of a matrix/vector. The Fourier matrix W is a symmetric n× n matrix where the
entry at row r and column c is defined as: Wr,c = e
−(2pii/n)rc.
The Cooley-Tukey algorithm is an efficient and recursive algorithm for computing the DFT of a
vector. The algorithm arranges x as an n1 × n2 matrix X (in row-major order) where n = n1 · n2;
each column X∗,c is replaced with its DFT and then each entry Xr,c is multiplied by the twiddle
factor wrcn ; finally, each row Xr,∗ is replaced by its DFT and the DFT of x is given by reading the
final matrix X in column-major order.
For simplicity, we assume that the TCU model can perform operations (e.g., addition, products)
on complex numbers; this assumption can be easily removed with a constant slow down in the
running time: for instance, the multiplication between
√
m×√m complex matrices can be computed
with four matrix multiplications and two sums of real values.
To compute the DFT of x using a (m, `)-TCU, we use the Cooley-Tukey algorithm where we
set n1 =
√
m and n2 = n/
√
m (we assume all values to be integers). Then, we use the tensor unit
for computing the n2 DFTs of size n1 =
√
m by computing XT ·W√m. Then, we multiply each
element in X by its twiddle factor and transpose X. Finally, we compute the n1 DFTs of size n2: if
n2 >
√
m, the DFTs are recursively computed; otherwise, if n2 ≤
√
m, the n1 DFTs are computed
with the multiplication XT ·Wn2 by using the tensor unit.
Theorem 7. The DFT of a vector with n entries can be computed in a (m, `)-TCU in time
T (n) = O ((n+ `) logm n) .
13
Proof. In each recursive level, we load matrix B = W√m at the beginning and then compute the
n2 DFTs of n1 entries by multiplying an n/
√
m×√m matrix with W√m in time O (n+ `). The
remaining operations (i.e., matrix transposition and multiplication by twiddle factors) cost O (n)
time. The running time is thus given by the next recurrence, from which follows the statement.
T (n) =
{ √
mT (n/
√
m) +O (n+ `) if n > m
O (m+ `) if n ≤ m
We observe that we use as base case n ≤ m and not n ≤ √m: indeed, when n ≤ m we use the
tensor core for computing the n2 DFTs of size n1 =
√
m, but also the n1 DFTs of size n2 ≤
√
m.
This analysis gives a tighter upper bound than using n ≤ √m as base case.
We observe that the above algorithm generalizes the approach used in [28] on an NVIDIA
Verdi architecture. The paper decomposes the vector using n1 = 4 and n2 = n/4 and then solves
subproblems of size 4 using a tensor core, since a TC can multiply 4× 4 matrices.
4.6 Stencil computations
Stencil computations are iterative kernels over a d-dimensional array, widely used in scientific
computing. Given a d-dimensional matrix A, a stencil computation performs a sequence of sweeps
over the input: in a sweep, each cell is updated with a function f(·) of the values of its neighboring
cells at previous sweeps. An example of stencil computation is the discretization of the 2D heat
equation, where each entry at time t is updated as follows:
At[x, y] = At−1[x, y]+
+
α∆t
∆x2
(At−1[x− 1, y] +At−1[x+ 1, y]− 2At−1[x, y])
+
α∆t
∆y2
(At−1[x, y − 1] +At−1[x, y + 1]− 2At−1[x, y])
where α,∆t,∆x2,∆y2 are suitable constant values given by the heat diffusion equations and by the
discretization step.
For the sake of simplicity, we assume d = 2 and that each update depends only on the values
of the cell and of its eight (vertical/horizontal/diagonal) neighbors at previous sweep. However,
all techniques presented in this section extend to any d = O (1) and to any update function that
depends on a constant number of neighbors.
Given n, k ≥ 1, an (n, k)-stencil computation, over an input √n×√n matrix A is the matrix
Ak obtained by the following iterative process: let A0 = A and 1 ≤ t ≤ k; matrix At is defined
by computing, for each 0 ≤ i, j < √n, At[i, j] = f(i, j, At−1) where f is a suitable function of cells
At−1[i+ α, j + β] with α, β ∈ {−1, 0, 1}. We say that a stencil computation is linear if f is a linear,
that is
At[i, j] =
∑
α,β∈{−1,0,1}
wα,βAt−1[i+ α, j + β].
where wα,β are suitable real values. The above stencil computation for approximating heat equations
is linear. We assume k to be even and that all values are integers.
By unrolling the update function of a linear (n, k)-stencil computation, each entry Ak[i, j] can
be represented as a linear combination of O
(
k2
)
entries of A, specifically all entries (i′, j′) in A
14
where |i− i′| ≤ k and |j − j′| ≤ k. That is, there exists a (2k + 1)× (2k + 1) matrix W such that
At[i, j] =
∑
−k≤α,β≤k
W [k + α, k + β]A[i+ α, j + β].
We now show that a linear (n, k)-stencil on a matrix A reduces to Θ
(
n/k2
)
convolutions of
size O
(
k2
)
, which are then computed with the TCU algorithm for DFT in Theorem 7. Let us
assume that matrix A is split into submatrices Ar,c of size k × k, with 0 ≤ r, c <
√
n/k; similarly,
let Ak,r,c denote the k × k submatrices of Ak. For each Ar,c, we define the following matrix A′r,c of
size 3k × 3k:
A′r,c =
 Ar−1,c−1 Ar−1,c Ar−1,c+1Ar,c−1 Ar,c Ar,c+1
Ar+1,c−1 Ar+1,c Ar+1,c+1
 .
where we assume that a matrix Ai,j is a zero matrix when i and j are not in the range [0,
√
n/k).
We then compute the circular discrete convolution A∗r,c = A′r,c ~W ′, where W ′ is a 3k × 3k matrix
obtained by flipping W and by adding k/2 (resp., k/2− 1) rows and columns of zeros on the left
and top (resp., right and bottom) sides of W .2 Finally, we set Ak,r,c to be the k× k matrix obtained
from A∗r,c by selecting the i-row and j-th column for all k ≤ i, j < 2k. By repeating the following
procedure for each submatrix Ar,c, we get the output matrix Ak.
Each convolution can be efficiently computed by exploiting the convolution theorem and the
DFT algorithm of Theorem 7. We indeed recall that a 2-dimensional DFT is given by computing a
1-dimensional DFT for each row and for each column. If W is given, we have the following:
Lemma 1. Given a linear (n, k)-stencil computation and its weight matrix W , then the stencil can
be computed in a (m, `)-TCU in time T (n, t) = O ((n+ `) logm k) .
Proof. The correctness follows by observing that, by definition of convolution, we have for each
k ≤ i, j < 2k:
A∗r,c[i, j] =
∑
α,β∈
[− 3k
2
, 3k
2
)
A′r,c[i+ α, j + β]W
′
[
3k
2
− α, 3k
2
− β
]
=
∑
α,β∈
[−k,k]
A′r,c[i+ α, j + β]W [k + α, k + β]
= Ak[i, j].
Each convolution has size Θ
(
k2
)
and is computed with Θ (k) DFT of Θ (k) elements and Θ
(
k2
)
element-wise products. The time per convolution is then O
(
(k2 + `k) logm k
)
time, while the time of
the entire algorithm is O ((n+ `n/k) logm k). However, the latency cost can be reduced by observing
that all the Θ (n/k) DFTs of Θ (k) elements can be computed concurrently; then, for each of the
O (logm k) recursive levels of the DFT algorithm, we use a 2 tall left matrices of size Θ (n/k)×Θ (k)
for computing the O (n/k) DFTs (we need two matrices because we first compute the DFTs of the
rows, and then on the columns). The claimed result follows.
2With a slight abuse of notation, given two n × n matrices A and B with n even, we define (A ~ B)[i, j] =∑
α,β∈[−n/2,n/2)A[(i+α) mod n, (j+β) mod n]W [n/2−α, n/2−β]. In the paper, we omit the mod operation from
the notation.
15
The weight matrix can be trivially computed in O
(
k3
)
time by recursively unrolling function
f . However, as soon as k ≥ (n logm k)1/3, the cost for computing W dominates the cost of the
stencil algorithm. A more efficient solution follows by representing W as the powering of a bivariate
polynomial and then using the DFT to compute it, as shown in the following lemma.
Lemma 2. The weight matrix of a linear (n, k)-stencil computation can be computed in a (m, `)-TCU
in time O
(
k2 logm k + ` log k
)
.
Proof. Let fu be function f after unrolling it u-th times, that is Ak[i, j] = fu(i, j, Ak−u). We
assign to each fu a bivariate polynomial Fu(x, y) =
∑
−k≤α,β≤k wu[α, β]x
αyβ , where wu[α, β] are the
coefficients of Ak−u[i+α, j+β] in fu. By setting F0(x, y) = 1, it follows by induction that Fu(x, y) =
Fu−1(x, y)P (x, y) where P (x, y) =
∑
α,β∈{−1,0,1}wα,βx
αyβ. Entry W [i, j] is then given by the
coefficient of the xi−kyj−k in Fk = F0(x, y)∗P (x, y)k. Since P (x, y)k = P (x, y)dk/2e)2P (x, y)k mod 2,
we compute recursively the coefficient in Θ (log k) recursive calls, each one performing a convolution
using the TCU algorithm for DFT of geometrically decreasing size. We then get the main statement.
By the previous two results, we then have the main result:
Theorem 8. Given a linear (n, k)-stencil computation with k ≤ n, then the stencil can be computed
in a (m, `)-TCU in time T (n, t) = O (n logm k + ` log k) .
Proof. The claim follows by summing the time complexities of Lemmas 2 and 1.
4.7 Integer multiplication
We now study how to multiply two long integers by exploiting a (m, `)-TCU. The input is given
by two integers a and b of n bits each (without loss of generality, we assume both integers to be
positive and n > m), and the output is the binary representation of c = a ∗ b, of size 2n− 1. For
this problem, we introduce in the design a third parameter κ, which is the bit length of a memory
word in the TCU model. We assume that κ = Ω (log n), that is there are enough bits in a word to
store the input/output size. It is easy to see that the tensor unit can multiply matrices of (positive)
integers of κ′ = κ/4 bits without overflow: the largest integer in the output matrix using κ′ bits is
22κ
′√
m which requires 2κ′ + log
√
m < κ (if n >> m, then κ′ = κ/2− 1 suffices).
We initially show how to speed up the long integer multiplication algorithm [16], also known
as the schoolbook algorithm, by exploiting the tensor unit. Then, we will use this algorithm to
improve the recursive Karatsuba algorithm [14].
Let A(x) =
∑n′−1
i=0 Aix
i be a polynomial where n′ = n/κ′ and Ai = (a(i+1)κ′−1 . . . aiκ′)2 is the
integer given by the ith segment of κ′ bits of a. Let B(x) be defined similarly for b. We have
that a = A(2κ
′
) and b = B(2κ
′
). We define C(x) = A(x) · B(x) and we observe that c is given
by evaluating C(2κ
′
). Note that A(X) and B(X) have degree n′ − 1, while c has degree at most
(2n − 1)/κ′ ≤ 2n′ − 1. The coefficients of C(x) can be computed with the matrix multiplication
C = A ·B where: 1) B is the column vector with the n′ coefficients of B(X); 2) A is a (2n′− 1)×n′
matrix where Ai,j = An′−i+j−1 and we assume that Ah = 0 if h < 0 or h ≥ n′.
The product C = A · B cannot exploit TCU since B is a vector. To fully exploit an (m, `)-
TCU, we show how to calculate C coefficients via the multiplication C ′ = A′ · B′ where A is a
(n′ +
√
m− 1)×√m matrix and B is a √m× n′/√m matrix.
• Matrix B′ follows by considering vector B as the column major representation of a√m×n′/√m
matrix, that is B′i,j = Bn′−i−j√m−1.
16
• Matrix A′ is given by considering all segments of length √m in the sequence 0√m−1, A0, A1, . . .
An′−1, 0√m−1, where 0√m−1 denotes a sequence of
√
m − 1 zeros. More formally, the ith
row A′i,∗ is [An′−i−1, An′−i−2, . . . An′−i−√m], where we assume again that Ah = 0 if h < 0 or
h ≥ n′.
Then, we compute C ′ = A′ ·B′ with the algorithm for dense matrix multiplication of Theorem 2
(or equivalently the algorithms of Theorem 1): We decompose B′ into into n′/m submatrices of size√
m×√m and then compute n′/m products of a (n′ +√m− 1)×√m matrix with a √m×√m
matrix. Finally, the coefficient of the xh indeterminate in C(x), for each 0 ≤ h < 2n′ − 1, follows by
summing all entries in C ′i,j such that h = 2(n
′ − 1)− i− j√m. Finally we compute c = C(2κ′).
Theorem 9. Two integers of n bits can be multiplied in a (m, `)-TCU with κ bits operations in
time
T (n) = O
(
n2
κ2
√
m
+
n
κm
`
)
.
Proof. Let Ch be the coefficient of C(x) of x
h indeterminate. We have:
Ch =
∑
i,j|i+j=h
aibj =
dh/√me∑
p=0
h−p√m∑
i=h−(p+1)√m+1
aibh−i
where as usual ai = 0 and bi = 0 if i < 0 or i ≥ n′. By definition of matrices A′ and B′, we can
rewrite the previous equation as:
Ch =
dh/√me∑
p=0
A′n′−h+p√m−1,∗B
′
∗,(n′−1)/√m−p
=
dh/√me∑
p=0
Cn′−h+p√m−1,(n′−1)/√m−p
We observe that the last sum is including all entries in Ci,j where h = 2(n
′ − 1) − i − j√m, as
required in the algorithm. The algorithm then correctly computes all Ch coefficients and hence
c = a · b.
We now consider the running time. The cost of the matrix multiplication C ′ = A′ · B′ is
O
(
n′
m (n
′√m+ `)
)
. The cost of computing the Ch coefficients and the final evaluation is upper
bounded by O (n′/
√
m). The claim follows.
The Karatsuba algorithm is a well-known algorithm that computes c = a · b by recursively
computing three integer multiplications of size n/2 and then combining the solution in O (n/κ) time.
If we stop the recursion as soon as the input size is n ≤ k√m and solve the subproblem with the
algorithm of Theorem 9, we get the following result.
Theorem 10. Two integers of n bits can be multiplied in a (m, `)-TCU with κ bits operations in
time
T (n) = O
((
n
κ
√
m
log 3
)(√
m+
`√
m
))
.
Proof. The running time follows by the recurrence:
T (n) =
{
3T (n/2) +O (n/κ) if n > k
√
m
O (
√
m+ `/
√
m) if n ≤ k√m
17
4.8 Batch polynomial evaluation
We now show how to exploit the TCU for evaluating a given polynomial of A(x) =
∑n−1
i=0 aix
i of
degree n− 1 on p points pi, with 0 ≤ i < p. For simplicity we assume n to be a multiple of
√
m,
p ≥ √m, and that the polynomial can be evaluated without overflow on the memory word available
in the TCU model.
We initially compute for each pi the powers p
0
i , p
1
i , . . . p
√
m−1
i and p
√
m
i , p
2
√
m
i , . . . p
n−√m
i , that
is pji for each j ∈ {0, 1, . . . ,
√
m − 1} ∪ {k√m,∀k ∈ {1, . . . , n/√m − 1}}. We define the following
matrices:
• A matrix X of size p×√m where the ith row is Xi,∗ = [p0i , p1i , . . . , p
√
m−1
i ] for each 0 ≤ i < p.
• A matrix A of size √m×n/√m where Ai,j = ai+j√m for each 0 ≤ i <
√
m and 0 ≤ j < n/√m.
Stated differently, we consider the sequence a0, . . . , an−1 as the column major representation
of A.
We then compute the product C = X ·A by exploiting the tensor unit. As in the previous section,
we decompose A into
√
m×√m submatrices and then solve n/m multiplications. Then, for each
pi, the values A(pi) follows by the sum
∑n/√m−1
j=0 Ci,jp
j
√
m
i .
Theorem 11. A polynomial of degree n− 1 can be evaluated on p points on a (m, `)-TCU with κ
bits operations in time
T (n, p) = O
(
pn√
m
+ p
√
m+
n
m
`
)
.
Proof. The correctness follows by observing that:
n/
√
m−1∑
j=0
Ci,jx
j
√
m =
n/
√
m−1∑
j=0
√m−1∑
k=0
pki ak+j
√
m
 pj√mi
=
n/
√
m−1∑
j=0
√
m−1∑
k=0
p
k+j
√
m
i ak+j
√
m
=
n−1∑
h=0
ahp
h
i = A(pi).
The cost of the initial exponentiation and of the final evaluation is O (p(
√
m+ n/
√
m)). The cost
of computing C is given by invoking the tensor unit n/m times on two matrices of size p×√m and√
m×√m, that is O
(
pn√
m
+ nm`
)
.
5 Relation with the external memory model
In this section we highlight a relation between the external memory model and the TCU model. We
recall that the external memory model (also named I/O model and almost equivalent to the ideal
cache model) is a model capturing the memory hierarchy and it consists of an external memory of
potential unbounded size, of an internal memory of M ≥ 1 words, and a processor. The processor
can only perform operations with data in the internal memory, and moves (input/output) blocks of
B ≥ 1 words between the external memory and the internal memory. The I/O complexity of an
algorithm for the external memory model is simply the number of blocks moved between the two
memories. We refer to the excellent survey in [30] for a more exhaustive explanation.
18
The time of some of the previous TCU algorithms recall the I/O complexity of the respective
external memory algorithms. For instance, the cost of dense matrix multiplication with only semiring
operations (Theorem 2) is O
(
n3/2/
√
m
)
when ` = O (1), while the I/O complexity for the same
problem in the external memory model is O
(
n3/2/
√
M
)
when B = O (1) [30].
We observe that the product between two matrices of size
√
m×√m requires O (m) I/Os to load
and storing the input in an internal memory with M = 3m and B = O (1). Therefore any call to the
tensor unit in a TCU can be simulated in the external memory of size M = 3m with Θ (m) I/Os.
Leveraging on this claim, we show that a lower bound in the external memory model translates
into a lower bound in a weaker version of the TCU model. In the weak TCU model, the tensor
unit can only multiply matrices of size
√
m×√m (i.e., we cannot exploit tall left matrices). We
observe that any algorithm for the original TCU model can be simulated in the weak version with a
constant slowdown when ` = O (m): indeed, the multiplication between an n×√m matrix with a√
m×√m can be executed in the weak model by splitting the n×√m matrix into n/√m matrices
of size
√
m×√m and then performing n/√m matrix multiplications with total time O (n√m).
Theorem 12. Consider a computational problem P with a lower bound FP on the I/O complexity
in an external memory with memory size M = 3m + O(1) and block length B = 1. Then, any
algorithm for P in the weak TCU model requires Ω (FP) time.
Proof. Consider an algorithm for the weak (m, `)-TCU, and let T = Tt + To be the total running
time with T = o (FP): we denote with Tt the running time due to tensor units, and with To all the
remaining operations. We can simulate the algorithm on an external memory with M = 3m+O(1)
as follows. All standard operations are simulated using O(1) internal memory and incurring O (To)
I/Os. Each call to the tensor unit can be simulated in the external memory by loading the two
input matrices in the internal memory with O (m) I/Os, computing the product with no I/Os, and
then writing the result in external memory with O (m) I/Os. If we have k calls to the tensor unit,
the algorithm requires km = O (Tt) I/Os (recall that each call requires Θ (m) time). Thus the TCU
algorithm gives an external memory algorithm with I/O complexity O (Tt + To) = o (FP), which is
a contradiction. Therefore, we must have T = Ω (FP).
6 Conclusion
In this paper, we have introduced the first computational model for tensor core units, namely the
(m, `)-TCU model. We have used this model for designing and analyzing several algorithms from
linear algebra, broadening the class of algorithms that benefit of a fast hardware circuit for matrix
multiplication. The paper leaves several open questions:
• The TCU model should be experimentally validated. Do experimental performance behave as
predicted in the theoretical model? Can we use the TCU model for effectively model Google
TPUs and NVIDIA TCs?
• Which other computational problems can benefit of a tensor unit? Can existing algorithms
for deep learning on tensor cores be further improved?
• Hardware accelerators have parallel tensors and low numerical precision. How can we include
these features in the TCU model, and to what extent do they affect TCU algorithm design?
19
Acknowledgments
This work was partially supported by NSF grant CNS-1553510, UniPD SID18 grant, PRIN17
20174LF3T8 AHeAd, MIUR Departments of Excellence, UniBZ-CRC 2019-IN2091 Project, and
INdAM-GNCS Project 2020 NoRMA. Some results are based upon work performed at the AlgoPARC
Workshop on Parallel Algorithms and Data Structures at the University of Hawaii at Manoa, in
part supported by the NSF Grant CCF-1930579.
References
[1] P. Afshani and N. Sitchinava. Sorting and permuting without bank conflicts on gpus. In Proc.
European Symposium on Algorithms (ESA), pages 13–24, 2015.
[2] Apple neural engine. https://www.apple.com/iphone-xr/a12-bionic/.
[3] Arm machine learning processor. https://developer.arm.com/products/processors/
machine-learning/.
[4] G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Graph expansion and communication costs
of fast matrix multiplication. J. ACM, 59(6):32:1–32:23, 2013.
[5] A. Bjo¨rklund, R. Pagh, V. V. Williams, and U. Zwick. Listing triangles. In Proc. 41st
International Colloquium on Automata, Languages, and Programming (ICALP), pages 223–234,
2014.
[6] C. Bo¨hm, R. Noll, C. Plant, B. Wackersreuther, and A. Zherdin. Data mining using graphics
processing units. Trans. Large-Scale Data- and Knowledge-Centered Systems, 1:63–90, 2009.
[7] R. Carrasco, R. Vega, and C. A. Navarro. Analyzing GPU tensor core potential for fast
reductions, 2019. Arxiv 1903.03640.
[8] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The
MIT Press, 2001.
[9] A. Dakkak, C. Li, J. Xiong, I. Gelado, and W.-m. Hwu. Accelerating reduction and scan using
tensor core units. In Proc. ACM International Conference on Supercomputing (ICS), pages
46–57, 2019.
[10] J. Dean. Large-scale deep learning with tensorflow for building intelligent systems. ACM
Webinar, 2016.
[11] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In B. Leibe,
J. Matas, N. Sebe, and M. Welling, editors, Proc. of Computer Vision(ECCV), pages 630–645,
2016.
[12] R. Jacob and M. Sto¨ckel. Fast output-sensitive matrix multiplication. In Proc. European
Symposium on Algorithms (ESA), pages 766–778, 2015.
[13] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia,
N. Boden, A. Borchers, et al. In-datacenter performance analysis of a tensor processing unit.
In Proc. 44th Annual International Symposium on Computer Architecture (ISCA), pages 1–12,
2017.
20
[14] A. Karatsuba and Y. Ofman. Multiplication of Multidigit Numbers on Automata. Soviet
Physics Doklady, 7:595, 1963.
[15] B. Karsin, V. Weichert, H. Casanova, J. Iacono, and N. Sitchinava. Analysis-driven engineering
of comparison-based sorting algorithms on gpus. In Proc. 32nd International Conference on
Supercomputing (ICS), pages 86–95, 2018.
[16] J. Kleinberg and E. Tardos. Algorithm Design. Addison Wesley, 2006.
[17] F. Le Gall. Powers of tensors and fast matrix multiplication. In Proc. 39th International
Symposium on Symbolic and Algebraic Computation (ISAAC), pages 296–303, 2014.
[18] F. T. Leighton. Introduction to Parallel Algorithms and Architectures: Array, Trees, Hypercubes.
Morgan Kaufmann Publishers Inc., 1992.
[19] S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. Vetter. Nvidia tensor core
programmability, performance precision. In Proc. IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW), pages 522–531, 2018.
[20] M. Nobile, P. Cazzaniga, A. Tangherloni, and D. Besozzi. Graphics processing units in
bioinformatics, computational biology and systems biology. Briefings in Bioinformatics, 18,
2016.
[21] Nvidia Tesla V100 GPU architecture. http://images.nvidia.com/content/
volta-architecture/pdf/volta-architecture-whitepaper.pdf.
[22] M. A. Raihan, N. Goli, and T. M. Aamodt. Modeling deep learning accelerator enabled GPUs.
In Proc. IEEE International Symposium on Performance Analysis of Systems and Software
(ISPASS), pages 79–92, 2019.
[23] R. Raz. On the complexity of matrix product. SIAM Journal on Computing, 32(5):1356–1369,
2003.
[24] B. Reagen, R. Adolf, P. N. Whatmough, G. Wei, and D. M. Brooks. Deep Learning for Computer
Architects. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2017.
[25] A. Rodriguez, E. Segal, E. Meiri, E. Fomenko, Y. J. Kim, H. Shen, and B. Ziv. Lower numerical
precision deep learning inference and training. Technical report, Intel, 2018.
[26] R. Seidel. On the all-pairs-shortest-path problem in unweighted undirected graphs. J. Comput.
Syst. Sci., 51(3):400–403, 1995.
[27] S. Shi, Q. Wang, P. Xu, and X. Chu. Benchmarking state-of-the-art deep learning software
tools. In Proc. 7th International Conference on Cloud Computing and Big Data (CCBD), pages
99–104, 2016.
[28] A. Sorna, X. Cheng, E. D’Azevedo, K. Won, and S. Tomov. Optimizing the fast fourier
transform using mixed precision on tensor core hardware. In Proc. IEEE 25th International
Conference on High Performance Computing Workshops (HiPCW), pages 3–7, 2018.
[29] V. Strassen. Gaussian elimination is not optimal. Numer. Math., 13(4), 1969.
[30] J. S. Vitter. Algorithms and data structures for external memory. Foundations and Trends in
Theoretical Computer Science, 2(4):305–474, 2006.
21
[31] V. V. Williams. Multiplying matrices faster than Coppersmith-Winograd. In Proc. 44th
Symposium on Theory of Computing Conference (STOC), pages 887–898, 2012.
[32] Y. Zhu, M. Mattina, and P. Whatmough. Mobile machine learning hardware at arm: A
systems-on-chip (SoC) perspective, 2018. Arxiv 1801.06274.
22
