A Parallel Sparse Tensor Benchmark Suite on CPUs and GPUs by Li, Jiajia et al.
A Parallel Sparse Tensor Benchmark Suite on CPUs
and GPUs
Jiajia Li
jiajia.li@pnnl.gov
Pacic Northwest National
Laboratory
902 Baelle Blvd
Richland, WA, USA 99354
Mahesh Lakshminarasimhan
maheshl@cs.utah.edu
University of Utah
Salt Lake City, UT, USA 84112
Xiaolong Wu
xu1565@purdue.edu
Purdue UniversityElectrical and
Computer Engineering
West Lafayee, USA 47907
Ang Li
ang.li@pnnl.gov
Pacic Northwest National
Laboratory
Richland, WA, USA 99354
Catherine Olschanowsky
catherineolschan@boisestate.edu
Boise State University
Boise, ID, USA 83716
Kevin Barker
Kevin.Barker@pnnl.gov
Pacic Northwest National
Laboratory
Richland, WA, USA 99354
Abstract
Tensor computations present signicant performance chal-
lenges that impact a wide spectrum of applications ranging
from machine learning, healthcare analytics, social network
analysis, data mining to quantum chemistry and signal pro-
cessing. Eorts to improve the performance of tensor com-
putations include exploring data layout, execution schedul-
ing, and parallelism in common tensor kernels. is work
presents a benchmark suite for arbitrary-order sparse ten-
sor kernels using state-of-the-art tensor formats: coordinate
(COO) and hierarchical coordinate (HiCOO) on CPUs and
GPUs. It presents a set of reference tensor kernel imple-
mentations that are compatible with real-world tensors and
power law tensors extended from synthetic graph generation
techniques. We also propose Rooine performance models
for these kernels to provide insights of computer platforms
from sparse tensor view.
1 Introduction
Tensors, multi-dimensional arrays that are oen sparse, are
utilized by a large number of critical applications that span
a range of domain areas. ese include quantum chemistry,
healthcare analytics, social network analysis, data mining,
signal processing, machine learning, and more. Operations
on sparse tensors tend to dominate the execution-time of
these applications. Understanding the performance charac-
teristics of dierent implementation approaches is of para-
mount importance. is paper presents a benchmark suite
specically for that purpose. e suite provides implementa-
tions of common tensor kernels using state-of-the-art sparse
tensor data structures and a variety of real and synthetic
sparse tensors as its input dataset.
Conference’17, Washington, DC, USA
2016. 978-x-xxxx-xxxx-x/YY/MM. . .$15.00
DOI: 10.1145/nnnnnnn.nnnnnnn
Given the heterogeneity in available hardware resources
for high performance computing (HPC), it is non-trivial to
answer questions about the potential for sparse tensor al-
gorithms to be eciently ported to various hardware. e
diculty of planning for the irregular parallelism that results
from operating on sparse data structures is compounded by
the availability of Graphics Processing Units (GPUs), vec-
torizing units, Field Programmable Gate Arrays (FPGAs),
and potentially Tensor Processing Units (TPUs). A set of
important tensor kernels with associated implementations
eases the exploration of this space.
Optimizing the performance of tensor applications is chal-
lenging due to several application characteristics, named
in [13, 32, 36, 40] and briey outlined here for complete-
ness: the curse of dimensionality, mode orientation, tensor
transformation, irregularity, and arbitrary tensor orders (or
dimensions).
Tensors are, by denition, multidimensional. e curse of
dimensionality manifests itself as large computational and
storage overheads required to accommodate the exponen-
tial growth of elements that occurs in some operations. For
instance, a Kronecker product results in exponential expan-
sion of space requirements. Compounding this issue is the
increased interest in applications involving a large number
of dimensions [16, 29, 33, 38, 51]. e data structures sup-
porting sparse tensors and the required tensor operations
are oen mode specic, where each dimension of a tensor is
referred to as a mode. Dierent data structures supporting
sparse tensors favor iterating over specic modes, mode ori-
entation. ere is a tradeo that must be made between space
requirements and enjoying good performance in multiple
representations of various mode sequences. Tensor transfor-
mation is traditionally used to implement tensor operations
by casting them as a set of matrix operations and utilizing
highly tuned linear algebra libraries. However, the transfor-
mation process brings non-trivial overhead to the execution
ar
X
iv
:2
00
1.
00
66
0v
1 
 [c
s.D
C]
  2
 Ja
n 2
02
0
Conference’17, July 2017, Washington, DC, USA Li et al.
of a tensor operation. Mitigating this cost has become at-
tractive for researchers in tensor linear algebra and their
applications [17, 37, 41, 48, 60]. Irregularity in memory ac-
cess paerns and in tensor shape makes poor use of memory
subsystems and complicates code, especially for sparse data.
Optimizations are typically best suited for a specic dimen-
sionality, such as third-order, but most tensor operations are
required to handle arbitrary tensor orders.
Beyond these, challenges associated with all benchmarks
also apply, which include completeness, diversity, extendibility,
reproducibility, and comparability across implementations.
Comparisons across research groups are improved by using
a standard set of kernels and inputs. Using that set as a
starting point, optimizations can be applied and eectively
compared.
Our benchmark suite consists of a set of reference imple-
mentations from various tensor applications, each of which
show dierent computational behavior. Much like
two-dimensional sparse matrices, the data layout, or the
data structure used to hold a sparse tensor, has a signicant
impact on performance and storage [43, 54]. It also has a
signicant impact on how the control ow for a given op-
eration must be executed and its memory footprints. We
implement two sparse tensor formats: the most popular and
mode-generic coordinate (COO) format and a newly pro-
posed, more compressed hierarchical coordinate (HiCOO)
format [42] to represent general, arbitrary sparse tensors.
Beyond the implementation diversity, platform and work-
load (or input) diversity is also critical to gain insights from
a benchmark suite. We implement the same set of tensor
kernels on CPUs and GPUs to provide a good understanding
for users. Dierent inputs of an algorithm usually obtain
dierent performance due to their diverse data sizes and pat-
terns. is phenomenon is more obvious for sparse problems
because their algorithm behavior largely depends on the fea-
tures of data. Besides evaluating limited and hard-to-obtain
real-world tensors, mimicking some application characters
to generate more datasets is valuable for benchmarking. We
create power law tensors extended from synthetic graph gen-
eration techniques. Our benchmark suite can easily adopt
new sparse tensor kernels. We use oating point operations
per second (FLOPS) to compare between kernels and plat-
forms.
e contributions of this work include:
• reference implementations for ve tensor kernels,
Tew, Ts, Ttv, Ttm, and Mttkrp, in COO and HiCOO
formats for CPUs and GPUs; (Sections 2 and 3)
• application of HiCOO to more tensor operations and
an extension of it to more exible variations; (Sec-
tion 3)
• synthetic tensor generation based on Kronecker and
power law generators; (Section 4)
• Rooine performance models for two Intel CPU and
two NVIDIA GPU platforms to analyze the tensor
kernels; and
• insights gained from thorough experiments and anal-
ysis of the performance. (Section 5)
2 Tensor Benchmarks
Tensors are increasingly employed in computations across a
spectrum of application areas. is benchmark suite repre-
sents a set of basic operations chosen by examining a range of
composite operations commonly used by these applications.
e following text provides the denition of each operation,
the motivation for its inclusion, and its applications.
Notationally, we represent tensors as calligraphic capital
leers, e.g., X ∈ RI×J×K ; matrices by boldface capital leers,
e.g., U ∈ RI×J ; vectors by boldface lowercase leers, e.g.,
x ∈ RI ; and scalars by lowercase leers, such as xi jk for the
(i, j,k)-element of a third-order tensor X. A slice is a two-
dimensional cross-section of a tensor, obtained by xing
all indices but two, e.g., S::k = X(:, :,k). A ber is a vector
extracted from a tensor along a certain mode, selected by
xing all indices but one, e.g., f :jk = X(:, j,k).
2.1 Tensor Element-Wise Operations
Tensor element-wise (Tew) operations include addition, sub-
traction, multiplication, and division, that are applied to
every corresponding pair of elements from two tensor ob-
jects if they have the same order and shape (i.e., dimen-
sion sizes). For example, element-wise tensor addition of
X,Y ∈ RI1×···×IN is
Z = X + Y (1)
Similarly for element-wise tensor subtraction Z = X − Y,
multiplication Z = X ◦ Y, and division Z = X Y.
is operation is trivially implemented when the tensors
having exactly the same non-zero paern. However, the
more general case requires iterating over both tensors and
matching elements as the execution proceeds. When X and
Y have dierent paerns, predicting the storage required for
Z is an additional challenge.
2.2 Tensor-Scalar Operations
A Tensor-Scalar (Ts) operation is between the non-zero val-
ues of a tensor and a single scalar. Operations also include
addition (Tsa), subtraction (Tss), multiplication (Tsm), and
division (Tsd). For example, tensor-scalar multiplication of
X ∈ RI1×···×IN is
Y = X × s (2)
is benchmark suite implements only Tsa and Tsm, which
are sucient to suppor them all.
Tew andTs are commonly used in machine learning, quan-
tum chemistry, and more. Tensor convolution operation in
A Parallel Sparse Tensor Benchmark Suite on CPUs and GPUs Conference’17, July 2017, Washington, DC, USA
convolutional neural network (CNN) is a combination ofTew
and Tsm [28]; space mapping in quantum chemistry also in-
volves these two. Tew and Ts are simple tensor operations
and can be implemented along with other tensor operations.
We consider them separately in this benchmark suite because
of their dierent computational behavior (shown in Table 1).
2.3 Tensor-Times-Vector Product
e Tensor-Times-Vector (Ttv) in mode n is the multiplica-
tion of a tensor X ∈ RI1×···×In×···×IN with a vector v ∈ RIn ,
along mode n.
Y = X ×n v
yi1 · · ·in−1in+1 · · ·iN =
In∑
in=1
xi1 · · ·in−1in in+1 · · ·iNvin
(3)
is results in a I1 × · · · × In−1 × In+1 × · · · × IN tensor which
has one less dimension.
Ttv is a critical computational kernel of the tensor power
method [1, 66], an approach for orthogonal tensor decompo-
sition, that decomposes a symmetric tensor into a collection
of orthogonal vectors with corresponding weights. e ten-
sor power method is used in machine learning and signal
processing applications.
2.4 Tensor-Times-Matrix Product
e Tensor-Times-Matrix (Ttm) in mode n, also known as
the n-mode product, is the multiplication of a tensor X ∈
RI1×···×In×···×IN with a matrix U ∈ RIn×R , along mode n, and
is denoted by Y = X×n U. 1 is results in a I1 × · · · × In−1 ×
R × In+1 × · · · × IN tensor, and its operation is dened as
yi1 · · ·in−1r in+1 · · ·iN =
In∑
in=1
xi1 · · ·in−1in in+1 · · ·iNuinr . (4)
Ttm is a special case of tensor contraction, a multiplication
between two tensors in common mode(s). We consider Ttm
specically because: 1) it is more commonly used in ten-
sor decompositions, such as the Tucker decomposition, for
a variety of applications, including (social network, elec-
trical grid) data analytics, numerical simulation, machine
learning, recommendation systems, personalized web search,
etc. [1, 14, 32, 55]; 2) the behavior of tensor contraction
largely depends on which mode(s) to be contracted on; this
creates diculties for benchmarking. Also, note that R is
typically much smaller than In in low-rank decompositions,
and typically R < 100.
1Our convention for the dimensions of U diers from that of Kolda and
Bader’s denition [32]. In particular, we transpose the matrix modes U,
which leads to a more ecient Ttm under the row-major storage convention
of the C language.
2.5 Matriced Tensor-Times-Khatri-Rao Product
Mttkrp, matricized tensor times Khatri-Rao product, is a
matricized tensor times the Khatri-Rao product of matrices.
For an N th-order tensorX and given matricesU(1), . . . ,U(N ),
the mode-n Mttkrp is
U˜(n) = X(n)
(
U(N )  · · ·  U(n+1)  U(n−1)  · · ·  U(1)
)
, (5)
where X(n) is the mode-n matricization of tensor X,  is the
Khatri-Rao product. e Khatri-Rao product is a “match-
ing column-wise” Kronecker product between two matrices.
Given matrices A ∈ RI×R and B ∈ RJ×R , their Khatri-Rao
product is denoted by C = A  B,
C = A  B = [a1 ◦ b1, a2 ◦ b2, . . . , aR ◦ bR ] , (6)
where C ∈ R(I J )×R , ar and br , r = 1, . . . ,R are columns of
A and B, ◦ is the outer product of vectors, a special case
of Kronecker product. However, the Khatri-Rao and Kro-
necker products typically require redundant computation or
extra storage to hold matrix operands, so in practice, these
operations tend to be not implemented directly but rather
integrated into tensor operations.
Mttkrp is the most computational expensive kernel in
CANDECOMP/PARAFAC decomposition (CPD), another pop-
ular tensor decomposition. CPD also has a wide application
in (healthcare, social network, brain signal, electrical grid)
data analytics, machine learning, recommendation systems,
signal processing, personalized web search, quantum chem-
istry, and other domains [1, 14, 32, 55].
Because of the varying computational behavior (shown
in Table 1) of the above operations, we integrate them as a
benchmark suite to help evaluate underlying hardware.
3 Tensor Formats and Kernel
Implementations
Much like sparse matrices, sparse tensors are expressed using
dierent formats. e best choice of format depends on the
sparsity paern of a tensor, operations applied, and the time
required to translate between them. e common default
format for sparse tensors is coordinate (COO) format. New
formats have been developed including compressed sparse
ber (CSF) [57], balanced CSF (BCSF) [25], agged COO (F-
COO) [45], and hierarchical coordinate (HiCOO) [42] for
general sparse tensors, and mode-generic and -specic for-
mats for structured sparse tensors [5]. Our benchmark suite
currently supports COO and HiCOO for general sparse ten-
sors and their variants for semi-sparse tensors with dense
dimension(s).
We chose COO and HiCOO formats because of their mode
generic property, described in [42]. When using a mode
generic format only one tensor representation is needed
for all tensor computations, even in dierent modes. is
is commonly required by tensor methods (e.g., CPD and
Tucker decompositions). COO is the most popular format
Conference’17, July 2017, Washington, DC, USA Li et al.
used in many tensor libraries, e.g., Tensor Toolbox [4], Ten-
sorlab [62], TACO [12, 31], and ParTI [39]. HiCOO, a newly
proposed format, obtains good compression and state-of-the-
art tuned performance [42]. Other formats especially CSF
will be considered for our benchmark suite in the near fu-
ture. is section overviews our supported formats and their
corresponding parallel implementations for tensor kernels.
We keep the implementations simple yet eective; the
benchmark represents a general case where the primary
computation is not obfuscated by optimization aempts. We
use more preprocessing to trade for less kernel computa-
tion, which keeps the benchmark more ecient compared
to the pillar libraries, e.g., Tensor Toolbox [4] and Tensor-
Lab [62]. Besides, our implementations directly operate on
sparse tensor elements to avoid the tensor-matrix transfor-
mations. is suite provides a performance baseline and a
starting point for new optimization strategies. It is easy to
adopt new implementations, operations, and formats from
users and to be adapted in a communication scheme.
3.1 Coordinate Format (COO)
Coordinate format is commonly used to store sparse matrices
and tensors. It does not require or guarantee any specic
ordering of the data. e data values are stored in a one-
dimensional array, no maer how many dimensions are
represented in the data. For each dimension an additional
index array is added that indicates the position of the value in
that dimension. Figure 1(a) gives an example that a general
third-order sparse tensor requires three index arrays. e
storage space of an N th-order COO tensor X ∈ RI1×···×IN
with M non-zeros is 4(N + 1)M bytes consisting of 32-bit
indices and single-precision oating-point values.
We also describe a variant of COO format (semi-sparse
COO, sCOO) for a semi-sparse sparse tensor with dense
modes [5, 41], which will be used in Ttm. A dense mode
means the bers on it are all dense. sCOO stores the dense
mode(s) as dense array(s) and the rest modes are kept the
same as in COO format, as shown by another example tensor
in Figure 1(b), where the mode k is dense.
Figure 1. COO format for a general sparse tensor and sCOO
format [41] for a semi-sparse tensor.
3.2 Parallel Implementations based on COO
Table 1 presents the operational intensity of each kernel
using a cubical third-order tensor, while all the implementa-
tions in the benchmark suite support arbitrary tensor orders.
Operational intensity (OI) is the ratio of bytes required per
oating point operation for a given computation. For all
operations except Mttkrp, we have a pre-processing stage
to allocate output tensor space along with their indices.
e implementations of Tew and Ts follow directly from
their Equations, (1) and (2), that is one loop over all non-zero
values to do the corresponding computation. In the pre-
precessing stage, we allocate and set indices for the output
tensors due to their easy-to-predict non-zero paern. 2 Tew
and Ts have the smallest operational intensity: 1/12 and 1/8.
We will use Ttv algorithm as a representative to explain the
similar Ttm algorithms as well; Ttm algorithms can be found
in [41, 47], and Mttkrp algorithms can be easily found in
literatures and soware [4, 39, 42, 62].
Table 1. e analysis of kernel algorithms for third-order
cubical tensors (X ∈ RI×I×I ). We consider all input tensors
with M non-zero entries and MF bers, I  MF  M . e
indices use 32 bits, and values are single-precision oating-
point numbers with 32 bits as well.
Kernels Work Memory Access (#Bytes) OI(#Flops) COO HiCOO
Tew M 12M 12M 1/12
Ts M 8M 8M 1/8
Ttv 2M 12M + 12MF 12M + 12MF ∼ 1/6
Ttm 2MR 4MR + 4MFR 4MR + 4MFR ∼ 1/2
+8M + 8MF +8M + 8MF
Mttkrp 3MR 12MR + 16M 12Rmin {nbMB, M } ∼ 1/4
+7M + 20nb
∗ We consider only one-level Cache with the minimum cache size to satisfy
the data reuse in algorithms.
3.2.1 Multicore CPU
We use a sparse-dense property for a sparse tensor times
a dense vector/matrix (Ttv and Ttm), introduced by Li et
al. [41]. at is, if the computation is between a sparse mode
of a tensor and a dense vector from the vector itself or a
matrix, this mode will become dense in the output; and the
other modes keep the same non-zero distribution (or sparsity)
with the original modes of the input tensor. is property
makes pre-allocating space for the outputs of Ttv and Ttm
possible, with the help of the sCOO format for semi-sparse
tensors. Introducing this property is good for paralleliza-
tion by avoiding output data races and dynamic memory
allocation, especially useful for GPU implementations.
2For Tew with two input tensors have dierent non-zero paerns, we
support it in the benchmark suite but not analyze them for performance
perspective.
A Parallel Sparse Tensor Benchmark Suite on CPUs and GPUs Conference’17, July 2017, Washington, DC, USA
Algorithm 1 COO-Ttv-OMP algorithm.
Input: A third-order sparse tensorX ∈ RI×J×K with M non-zeros in COO
format, dense vector V ∈ RK , and an integer n (= 3);
Output: Sparse tensor Y ∈ RI×J in COO format;
. Y = X ×n v
1: Pre-process to obtain MF, the number of mode-n bers of X and fptr,
the beginning of each X’s mode-n ber, size MF.
2: Pre-allocate Y space with MF non-zeros and their indices;
3: parfor f = 1, . . . , MF do
4: fX = fptr(f )
5: inds1Y (f ) = inds1X (fX )
6: inds2Y (f ) = inds2X (fX )
7: v = valY (f )
8: form = fX , . . . , fptr(f + 1) − 1 do
9: k = inds3X (m)
10: v+ = valX (m) × u(k )
11:
12: Return Y;
COO-Ttv-OMP is illustrated in Algorithm 1, rstly pro-
posed in this work. We rst pre-process the input tensor X
to record the locations of mode-n bers. According to the
sparse-dense property, the output Y is pre-allocated with
MF non-zeros and its indices i, j the same as in X. e stor-
age is consists of 16M for X, 4I for v, and 12MF for Y 3.
e number of oating-point operations (#Flops) is 2M . e
memory access in Table 1 counts 4M bytes for v because of
its irregular and unpredictable memory access introduced
by index-k . Data reuse of v could happen if its access has or
gains a good localized paern naturally or from reordering
techniques [44, 57], similarly for the matrices in Ttm and
Mttkrp. us, their performance could be improved due to
reductions in memory pressure. e operational intensity is
approximately 1/6 by ignoring less signicant terms. COO-
Ttm-OMP is similar to COO-Ttv-OMP with the output as
a semi-sparse tensor stored in sCOO format. ey are both
parallelized for independent bers, but work imbalance may
exist because of dierent ber lengths of sparse tensor X.
COO-Mttkrp is widely used in Tensor Toolbox [4], Ten-
sorlab [62]; and COO-Mttkrp-OMP is implemented in
ParTI library [39]. Each row of A˜ is updated by scaling the
dot product of two rows of matrices B and C by the non-zero
value of X. Its operational intensity is approximately 1/4
again by ignoring less expensive terms. COO-Mttkrp-OMP
is parallelized by non-zeros, but with atomic operations to
protect output matrix. e data race may inuence its per-
formance dierently depending on non-zero distributions of
an input tensor.
3.2.2 GPU
COO-Tew-GPU,-Ts-GPU, and -Ttv-GPU use one-dimensional
CUDA grids of one-dimensional thread blocks to parallelize
non-zeros and bers respectively. For example, M non-zeros
3which is actually a matrix for a third-order Ttv.
are assigned to M/256 thread blocks with 256 threads for
each. Again, due to the potential unbalanced ber lengths,
COO-Ttv-GPU could suer more performance drop. While
COO-Ttm-GPU and -Mttkrp-GPU use one-dimensional
grids of two-dimensional thread blocks to parallelize the
dense matrices, both of them are implemented in ParTI li-
brary [39]. In COO-Ttm-GPU algorithm, the x-dimension of
thread blocks are used to represent matrix columns for GPU
memory coalescing, while y-dimension represents non-zeros.
(Refer to details in [47].) Be aware that the load imbalance
still exists for COO-Ttv and COO-Ttm, and also data race
for COO-Mttkrp.
3.3 Hierarchical Coordinate (HiCOO)
Hierarchical Coordinate (HiCOO) [42] is derived from COO
format but further compresses tensor indices in units of
sparse blocks with a pre-specied block size B. It represents
tensor indices using two-level block and element indices,
with element indices stored in only 8-bit. An extra block
pointer array bptr is needed to store the starting locations
of every ber. Figure 2(a) shows HiCOO representation for
the same tensor in Figure 1(a) in 2 × 2 × 2 blocks. e same
with COO format, HiCOO does not assume any mode order,
and only one representation of a sparse tensor is enough for
all its computations. While HiCOO saves the tensor storage
from two aspects: 1) shorter bit-length for element indices;
2) shortened array length for block indices. Readers can refer
to more details in the paper [42].
In this work we introduce two variants based on HiCOO
format: gHiCOO and sHiCOO. gHiCOO is a generalization
of HiCOO format for a general sparse tensor (Figure 2(b))
and sHiCOO is for semi-sparse tensors with dense mode(s)
(Figure 2(c)). Concluded by the prior work [42], HiCOO could
not be benecial for hyper-sparse tensors where most tensor
blocks only consist of one or few non-zeros. To conquer
this problem, we propose gHiCOO where we can pick which
modes to be compressed in units of blocks for HiCOO and
which stay in COO format. Figure 2(b) chooses to compress
modes i and j, leaving mode k in the same index array with
Figure 1(a). gHiCOO also provide convenience to implement
tensor operations where not all modes are needed during
computation, such as Ttv and Ttm. sHiCOO is similar to
sCOO but in HiCOO format. Figure 2(c) uses sHiCOO to
compress the same semi-sparse tensor in Figure 1(b) with
dense mode k. Our format variants could be useful in tensor
methods and benchmarking for more ecient space and
computation.
3.4 Parallel Implementations based on HiCOO
HiCOO parallel implementations are all rstly proposed here
except Mttkrp on CPUs [42]. Advanced techniques such
as privatization, lock-avoiding parallel strategies, advanced
scheduling [42] are not adopted as the purpose of this suite
Conference’17, July 2017, Washington, DC, USA Li et al.
Figure 2. HiCOO, gHiCOO formats for general sparse ten-
sors and sHiCOO for semi-sparse tensors.
is to act as a reference implementation for the community
and also to avoid complicated tuning parameters.
3.4.1 Multicore CPU
Since the pre-processing step deals with allocating space and
seing indices for the output tensor in HiCOO rather than
COO format, the value computation of HiCOO-Tew-OMP
and HiCOO-Ts-OMP is the same with COO-Tew-OMP and
COO-Ts-OMP respectively.
HiCOO-Ttv-OMP andHiCOO-Ttm-OMP also pre-allocate
the output tensors according to the sparse-dense property.
We use gHiCOO format to represent the input tensor X by
leaving the mode doing the product uncompressed. ere-
fore, Ttv and Ttm can bypass the blocking nature of HiCOO
and be performed without data race between blocks. en
the same computation will be implemented for HiCOO-Ttv-
OMP and HiCOO-Ttm-OMP as in their COO counterparts,
but pre-allocating the output tensors and their indices in
HiCOO or sHiCOO format respectively.
Algorithm 2 HiCOO-Mttkrp-OMP algorithm in mode-
1 [42].
Input: A sparse tensor X ∈ RI×J×K with M non-zeros in HiCOO format,
dense matrices B ∈ RJ×R, C ∈ RK×R , block size B;
Output: Updated dense matrix A˜ ∈ RI×R ;
. A˜← X(1)(C  B)
1: parfor b = 1, . . . , nb do
2: bi = binds1(b), bj = binds2(b), bk = binds3(b);
3: Ab = A + bi · B · R; Bb = B + bj · B · R; Cb = C + bk · B · R;
4: for x = bptr (b), . . . , bptr (b + 1) − 1 do . entry x
5: ei = einds1(x ), e j = einds2(x ), ek = einds3(x )
6: value = val(x )
7: for r = 1, . . . , R do
8: A˜b (ei, r )+ = value ·Cb (ek, r ) · Bb (e j, r )
9:
10: Return A˜;
HiCOO-Mttkrp-OMP is more complicated because in-
dices of all tensor modes are used in this computation, dif-
ferent from Ttv and Ttm where the indices from only one
mode are needed. We rst block matrices A, B, and C as Ab ,
Bb , and Cb to be reused for the non-zeros inside a tensor
block. For each block, we update the values of a block of out-
put matrix A˜ using corresponding blocks Bb , and Cb . In this
way, we do not need to compute the actual indices i, j,k out,
and data locality is increased due to blocking and Morton or-
der sorting when constructing a HiCOO representation [42].
Dierent from COO-Mttkrp-OMP, HiCOO-Mttkrp-OMP
parallelize tensor blocks rather than all non-zeros.
e analysis of HiCOO algorithms are also listed in Table 1.
Since HiCOO-Tew, -Ts, -Ttv, -Ttm all have the same value
computation step with COO counterparts, so they have the
same behavior except for the storage space, where HiCOO
is usually benecial. However, from our experiments, we
still observe some benets of HiCOO aected by the pre-
processing stage. However, HiCOO-Mttkrp has smaller
memory access than COO-Mttkrp due to its blocked feature.
3.4.2 GPU
HiCOO-GPU implementations are also the same with COO
ones except HiCOO-Mttkrp-GPU. is unoptimized HiCOO-
Mttkrp-GPU maps one tensor block to a CUDA thread block,
thus the balanced workload by non-zero distribution for
COO-Mttkrp disappears, while the atomic operation stays.
erefore, the work imbalance due to dierent numbers of
non-zeros in tensor blocks could be make its performance
even worse than COO-Mttkrp-GPU.
4 Tensor Dataset
Table 2. Description of sparse tensors.
No. Tensors Order Dimensions #Nnzs Density
r1 vast 3 165K × 11K × 2 26M 6.9 × 10−3
r2 nell2 3 12K × 9K × 29K 77M 2.4 × 10−5
r3 choa 3 712K × 10K × 767 27M 5.0 × 10−6
r4 darpa 3 22K × 22K × 24M 28M 2.4 × 10−9
r5 fb-m 3 23M × 23M × 166 100M 1.1 × 10−9
r6 fb-s 3 39M × 39M × 532 140M 1.7 × 10−10
r7 flickr 3 320K × 28M × 1.6M 113M 7.8 × 10−12
r8 deli 3 533K × 17M × 2.5M 140M 6.1 × 10−12
r9 nell1 3 2.9M × 2.1M × 25M 144M 9.1 × 10−13
r10 crime4d 4 6K × 24 × 77 × 32 5M 1.5 × 10−2
r11 uber4d 4 183 × 24 × 1140 × 1717 3M 3.9 × 10−4
r12 nips4d 4 2K × 3K × 14K × 17 3M 1.8 × 10−6
r13 enron4d 4 6K × 6K × 244K × 1K 54M 5.5 × 10−9
r14 flickr4d 4 320K × 28M × 1.6M × 731 113M 1.1 × 10−14
r15 deli4d 4 533K × 17M × 2.5M × 1K 140M 4.3 × 10−15
is benchmark suite uses sparse tensors derived from
real-world applications from online collections [26, 53, 56].
It also generates synthetic sparse tensors based on graph
models that preserve the properties of real-world graphs. 4
e benchmark suite can be run against any set of tensors
provided that they are expressed using coordinate format.
4.1 Tensors From Real World Applications
e tensors taken from real-world applications are described
in Table 2 sorted by tensor order and decreasing density.
ey are taken from e Formidable Repository of Open
4Our benchmark suite includes synthetic tensor generators, and links to
the tensors already included in existed collections [26, 53, 56].
A Parallel Sparse Tensor Benchmark Suite on CPUs and GPUs Conference’17, July 2017, Washington, DC, USA
Sparse Tensors and Tools (FROSTT) dataset [56], the HaTen2 [27]
dataset, and one built from electronic medical records from
Children’s Healthcare of Atlanta [53]. ese tensors were
chosen to represent a wide range of domains: paern recog-
nition (vast, nips4d), natural language processing (nell2, nell1),
healthcare analytics (choa), recommendation systems (deli,
deli4d, flickr4d), crime detection (crime4d), anomaly detection
(enron4d).
4.2 Synthetic Tensor Generation
e Kronecker graph model [34] and the biased power law
generator from the FireHose streaming benchmark [2] were
extended to generate 3- and 4-dimensional tensors. ese
methods were selected because the resultant graphs follow
the power law distribution, exhibit a small diameter, and
have a high average clustering coecient.
e soware to generate synthetic tensors is included in
the benchmark suite and the synthetic tensors used in our
experiments are described in Table 3 in a period order of
”small, medium, large”. e regular 3D and 4D tensors, which
are equidimensional along each mode, are generated using
the Kronecker generator. e irregular 3D and 4D tensors are
generated using the biased power law streaming generator.
e 3D irregular tensors have one mode completely dense
and much smaller compared to the two other modes which
are equidimensional and sparse. Two modes of the irregular
4D tensors are hypersparse and equidimensional, and the
other two dimensions are entirely dense and smaller.
4.2.1 e Stochastic Kronecker Graph Model
e Stochastic Kronecker graph model [34] is a fractal growth
model that preserves the previously listed properties of real-
world graphs. We extended this approach to generate sparse
tensors of order N by accepting the initiator as a tensor τ1
with N modes. Similar to the Stochastic Kronecker graph
model, by taking the repeated Kronecker product of τ1 for
n times, a larger N th-order tensor τn is produced. With
Bernoulli sampling, τn can be realized as a large sparse ten-
sor representing the resultant hyper-graph that follow the
properties of real-world networks.
e exponential growth of Kronecker multiplication limits
the sizes of N th-order tensors that can be generated. We
overcome this by performing an additional iteration of Kro-
necker multiplication and strip o the tensor coordinates
that fall outside the given size along each mode.
4.2.2 e Power Law Generator
e power law generated graphs do not possess a high aver-
age clustering coecient, there is no restriction on the size
of the graph generated. e generator produces a stream
of edges that when combined form a graph respecting the
power law distribution. is is used to create tensors by
combining together the sparse graphs to form slices of a
third order tensor which is hypergraph. is process, when
Table 3. Description of synthetic tensors generated (Gen.)
with Kronecker and Power law generators indicated as Kron.
and PL respectively.
No. Tensors Gen. Order Dimensions #Nnzs Density
s1 regS Kron. 3 (65K )3 1.1M 3.72 × 10−9
s2 regM Kron. 3 (1.1M )3 11.5M 9.97 × 10−12
s3 regL Kron. 3 (8.3M )3 94M 1.61 × 10−13
s4 irrS PL 3 (32K )2 × 76 1M 1.26 × 10−5
s5 irrM PL 3 (524K )2 × 126 10M 1.43 × 10−6
s6 irrL PL 3 (4.2M )2 × 168 84M 2.86 × 10−8
s7 regS4d Kron. 4 (8.2K )4 1M 2.26 × 10−10
s8 regM4d Kron. 4 (2.1M )4 11.2M 5.83 × 10−19
s9 regL4d Kron. 4 (8.3M )4 110M 2.23 × 10−20
s10 irrS4d PL 4 (1.6M )3 × 82 1.0M 2.90 × 10−9
s11 irrM4d PL 4 (2.6M )3 × 144 10.8M 4.17 × 10−12
s12 irrL4d PL 4 (4.2M )3 × 226 100M 6.0 × 10−15
s13 irr2S4d PL 4 (1.0M )2 × 122 × 436 1.6M 2.81 × 10−11
s14 irr2M4d PL 4 (4.2M )2 × 232 × 746 19.9M 6.53 × 10−12
s15 irr2L4d PL 4 (8.3M )2 × 952 × 324 109M 5.1 × 10−12
repeated on 3rd order tensors can generate a sparse tensor
with N modes.
5 Experimental Results
We test our tensor kernels on four parallel platforms in-
cluding Intel CPUs and NVIDIA GPUs and build Rooine
performance models to measure our performance bounds
for detailed analysis.
5.1 Congurations
5.1.1 Platforms
We run the experiments on four parallel platforms, the pa-
rameters of which are listed in Table 4. All Intel platforms
are non-uniform memory access (NUMA) machines with
2-4 NUMA nodes. We calculate the peak single-precision
(SP) oating point performance and main/global-memory
bandwidth from these parameters. e peak SP performance
of all machines is above 1 TFLOPS. GPUs show advantages
in peak performance and memory bandwidth over CPUs by
approximately 4 − 12× and 3 − 7× respectively.
Table 4. Platform parameters.
Parameters Intel CPUs NVIDIA GPUs
Bluesky Wingtip DGX-1P DGX-1V
Processor Intel Xeon Intel Xeon NVIDIA NVIDIAGold 6126 E7-4850v3 Tesla P100 Tesla V100
Microarch Skylake Haswell Pascal Volta
Frequency 2.60 GHz 2.20 GHz 1.48 GHz 1.53 GHz
#Cores 24 (12 × 2) 56 (14 × 4) 3584 5120
Peak SP 1.0 2.0 10.6 14.9
Perf. TFLOPS TFLOPS TFLOPS TFLOPS
LLC size 19 MB 35 MB 3 MB 6 MB
Mem. size 196 GB 2114 GB 16 GB 16 GB
Mem. type DDR4 DDR4 HBM2 HBM2
Mem. freq. 2.666 GHz 2.133 GHz 0.715 GHz 0.877 GHz
Mem. BW 256 GB/s 273 GB/s 732 GB/s 900 GB/s
Compiler gcc 7.1.0 gcc 5.5.0 CUDA Tkit 9.1 CUDA Tkit 9.0
Conference’17, July 2017, Washington, DC, USA Li et al.
5.1.2 Kernel Implementation Details
Since our data is naturally sorted in a particular mode order,
COO implementations could take some advantages from the
beer data locality. e addition operation is the representa-
tive of Tew and the multiplication operation represents Ts;
the performance using dierent operations are quite similar
in our experiments. For multicore CPU implementations,
we use OpenMP for parallelization with dierent scheduling
strategies, with the number of threads is set to the number of
physical cores. “omp atomic” is used to deal with data race
in Mttkrp, and “omp simd” is for vectorization of Ttm and
Mttkrp. We use “numactl –interleave=all –physcpubind” to
interleave memory allocation for beer memory bandwidth
usage and thread binding for lower scheduling overhead. For
GPU implementations, “atomicAdd” is used in Mttkrp. For
HiCOO format, we x the block size to 128 to t into the
last-level cache in all platforms and use only 8 bits to store
element indices. We use 16 as the column size for matrices
in Ttm and Mttkrp, to reect the low-rank feature in pop-
ular tensor methods. We run all kernels ve times to get
the average; the time of Ttv, Ttm, and Mttkrp is further
averaged among all tensor modes.
(a) Bluesky (b) Wingtip
(c) DGX-1P (d) DGX-1V
Figure 3. Rooine models marked with the operational
intensities of tensor kernels.
5.2 Rooine Performance Models
e Rooine performance model [63, 64] is a graphical rep-
resentation of machine characteristics. It is employed for
performance analysis in various application domains: digital
signal processing, e.g., Spiral [52], sparse/dense linear alge-
bra [64, 67], and Laice Boltzmann Magnetohydrodynam-
ics (LBMHD) [64]. e Empirical Rooine Tool (ERT) [46],
included into Intel Advisor tool, automates measuring the
target machine characteristics. ERT automatically generates
Rooine data including the maximum bandwidth for vari-
ous levels of the memory hierarchy which are obtained by
testing a variety of micro-kernels 5. ERT can utilize MPI,
OpenMP, and CUDA for parallelization; we congure it to
the corresponding compiler in Table 4 for tests.
5similar to STREAM benchmark suite [49]
Figure 3 plots the Rooine models for the four platforms
in Table 4 with DRAM and last-level cache (LLC) bandwidth
tested from ERT, and the theoretical peak SP performance
and DRAM bandwidth (not cache-aware) for reference. We
mark the operational intensities (#Flops/#Bytes) of our ten-
sor kernels calculated from Table 1 overlying with Rooine
models. “ERT-DRAM” bandwidth is the obtainable band-
width from benchmarking micro-kernels, thus OIs of all the
tensor kernels are marked on this line. From this gure, all
of the sparse tensor kernels we consider are main or global
memory bound for CPUs and GPUs respectively. Higher
bandwidth will accelerate kernel execution, while other fac-
tors such as beer data reuse (cache utilization) lowering
memory access pressure will also lead to performance im-
provement. We use the computed obtainable performance
of all tensor kernels as the upper bounds in our performance
gures below (called “Rooine performance”), calculated by
timing an OI value with the “ERT-DRAM” bandwidth. e
OI value is an accurate #Flops/#Bytes ratio by taking dier-
ent tensor features into account, especially for Ttv and Ttm
because of the MF term in Table 1.
5.3 Performance
is section presents the performance of all the ve tensor
kernels on two datasets, real and synthetic, for four plat-
forms. e performance in GFLOPS of each tensor kernel is
calculated from #Flops (in Table 1) divided by the measured
execution time. X-axis represents tensors using the numbers
in Tables 2 and 3 from dierent datasets.
Observation 1: Achieved performance is diverse and hard
to predict, which varies with the dimension sizes and non-zero
paerns of tensors, platforms, and data formats.
From Figures 4 to 7, the actual performance in GFLOPS are
extremely diverse between tensor kernels, data formats, plat-
forms, and datasets. Take synthetic dataset as an example,
the achieved performance varies from 0.8 GFLOPS (regS in
COO) to 81 GLOPS (irr2S4d in HiCOO) in Figure 4 on Bluesky
platform. Besides, the average performance of the ve ker-
nels ranges a lot as well. Tew, Ts, Ttv, Ttm, Mttkrp kernels
achieve an average of 14.6, 35.1, 6.3, 37.7, and 2.7 GFLOPS
respectively for COO format, and 22.3, 40.8, 14.4, 35.8, and
2.6 GFLOPS respectively for HiCOO format. is also shows
HiCOO on average behaves beer than COO format for Tew,
Ts, and Ttv and gets similar performance on Ttm and Mt-
tkrp. Even for the performance eciency (or bandwidth
eciency), these kernels still vary a large range from the
lowest 2% (Mttkrp on irr2S4d) to 353% (Ts on regS) for COO
format and 2% (Mttkrp on irr2S4d) - 479% (Ts on regS). (e
above 100% eciency phenomenon will be explained below.)
Also, quite dierent performance numbers are observed be-
tween real and synthetic datasets under the same tensor
kernel. For Wingtip platform in Figure 5, Ttv shows much
lower GFLOPS numbers than those on Bluesky. ough we
observe some trend especially for synthetic dataset and Ttm
A Parallel Sparse Tensor Benchmark Suite on CPUs and GPUs Conference’17, July 2017, Washington, DC, USA
(a) Real tensors
(b) Synthetic tensors
Figure 4. Single-precision performance of tensor kernels on the Bluesky platform with the Rooine performance.
(a) Real tensors
(b) Synthetic tensors
Figure 5. Single-precision performance of tensor kernels on the Wingtip platform with the Rooine performance.
(a) Real tensors
(b) Synthetic tensors
Figure 6. Single-precision performance of tensor kernels on NVIDIA DGX-1P with the Rooine performance.
(a) Real tensors
(b) Synthetic tensors
Figure 7. Single-precision performance of tensor kernels on NVIDIA DGX-1V with the Rooine performance.
Conference’17, July 2017, Washington, DC, USA Li et al.
operation, generally it is hard to predict the performance
of a sparse kernel, even operating with a dense matrix or
vector.
Observation 2: Performance is generally below the Rooine
performance calculated from main/global memory bandwidth
except for some small tensors ing into caches or algorithms
with good data locality thus making a good use of caches.
Most of cases in Figures 4 to 7 fall below the red Rooine
performance line calculated from main/global memory band-
width from Figure 3 except for some case of Tew and Ts on
Bluesky and Wingtip CPU platforms and Mttkrp on DGX-
1V GPU platform. Take Figure 4 as an example again, the ten-
sors exceed Rooine performance are all small tensors with
around 1M non-zeros: regS, irrS, regS4d, irrS4d, and irr2S4d for
Tew, all small and medium synthetic tensors and small real
tensors with 3-5M non-zeros: crime4d, uber4d, and nips4d. e
last level cache size of Bluesky is 19 MB which could reside
three tensor values each with around 1.6M non-zeros of Tew
and two tensor values each with around 2.4M non-zeros of
Ts. ese numbers matches with the small synthetic and
real tensors, while the medium tensors also gain some cache
benet. For Mttkrp on DGX-1V, COO-Mttkrp-GPU gets
higher performance than Rooine more on irregular-shaped
tensors in synthetic dataset. And the tensors achieve high
performance on DGX-1P are easier to get break the upper
bound on DGX-1V. One reason is that V100 GPU architec-
ture has a twice larger LLC (6M) than P100; besides, V100
get improved atomic operation performance which could
benet Mttkrp; another reason is that they may have very
good data reuse or small working-set size, e.g., tensors with
a very short mode, so it is cache that oers the data injec-
tion rate rather than o-chip memory; lastly, the integer and
oating-point operations have independent data-paths for
instruction issuing on Volta architecture. erefore, address
computation which is extensively used in Mttkrp can be
overlapped with oating-point operations, which may miti-
gate the waiting time for address calculation compared with
earlier GPU architectures.
Observation 3: It is hard to obtain good performance ef-
ciency for non-streaming kernels on multi-socket CPU ma-
chines because of NUMA eect, which could be even harder
than on GPUs.
On Bluesky (Figure 4), the average performance eciency
of Ttv, Ttm, and Mttkrp is 31%, 64%, 6% for COO format,
and 73%, 61%, 5% for HiCOO format; while the numbers
are 9%, 52%, 9% for COO and 13%, 47%, 9% for HiCOO on
Wingtip (Figure 5). ough Mttkrp behaves a lile higher
eciency on Wingtip, its eciency is still very low. e
increment could come from beer parallelism of Wingtip
with 56 cores. On DGX-1P GPU (Figure 6), the average
performance eciency of Ttv, Ttm, and Mttkrp is 30%,
60%, 40% for COO format, and 30%, 60%, 28% for HiCOO
format; while the numbers are 30%, 69%, 110% for COO
and 30%, 69%, 57% for HiCOO on DGX-1V (Figure 7). e
eciency numbers on four-socket Wingtip CPU are all lower
than those on DGX-1P and DGX-1V GPUs, while Ttv and
Ttm achieves beer or similar eciency than the two GPUs
on two-socket Bluesky CPU.
Observation 4: HiCOO algorithms is faster than or similar
to COO counterparts because of its beer local locality and
smaller memory footprint, exceptMttkrp on GPUs where load
imbalance and lower parallelism play more important roles.
From the average eciency and performance numbers
shown in Observation 1 and 3, HiCOO on average behaves
beer than COO format for Tew, Ts, and Ttv and gets sim-
ilar performance on Ttm and Mttkrp on CPU platforms.
On the two GPUs, due to their smaller last-level cache size,
HiCOO does not benet as much as on CPUs. From Figures 6
and 7, HiCOO gets very similar performance on Tew, Ts,
Ttv, and Ttm because of their similar execution code for
tensor value computation. While HiCOO-Mttkrp behaves
worse than COO-Mttkrp because of their dierent parallel
strategies. HiCOO parallelize tensor blocks with severe load
imbalance issue and lower parallelism compared to COO-
Mttkrp algorithm. us, to beer use HiCOO format, a
careful tuning need to be done according to architecture
features.
Observation 5: Dierent datasets expose very dierent
performance behavior, which shows the importance of synthetic
datasets to performance benchmarking and analysis.
Compare the performance trend of real and synthetic
datasets. Tew and Ts show obvious period trend from high
to low or low to high on CPUs and GPUs respectively due to
dierent cache sizes on synthetic dataset, while it is hard to
nd trends in real dataset. Ttv and especially Ttm show a
matching trend with the Rooine predicted performance for
both real and synthetic tensors. Since real tensors are from
diverse real application scenarios, it is hard to do benchmark-
ing and performance analysis solely based on them. Besides,
real tensors are limited due to data privacy and other pub-
licity issues. Extracting features from real tensors as a basis
to create more complete synthetic tensors would be very
helpful for sparse tensor research.
Overall, diverse performance behavior is observed among
the ve kernels and between the COO and HiCOO formats on
the Intel CPU and NVIDIA GPU platforms. Our benchmark
performance is still lower than the theoretical Rooine for
most cases, especially for Mttkrp. Advanced performance
optimization could be further explored, such as the lock-
avoiding techniques employed in [42].
6 Related Work
Work related to this benchmark suite includes various bench-
marking collections and synthetic benchmark data genera-
tion. is benchmark suite is distinct from previous eorts in
its focus on multi-dimensional tensors combined with both
real and synthetic input.
A Parallel Sparse Tensor Benchmark Suite on CPUs and GPUs Conference’17, July 2017, Washington, DC, USA
Benchmark suites measure machine aributes and exem-
plify computing paerns. Benchmarks that measure spe-
cic machine aributes include LINPACK [20], SPEC [18],
STREAM [49], GeekBench [23], Multimaps [59], Bandwidth [58],
and others. Most of them aim to measure memory band-
width achieved under varying conditions and a few target
architecture oating-point capabilities. Benchmark suites
are oen organized around the concept of application exem-
plars. ese suites emulate common paerns and behaviors
in application classes of interest. Several examples of these
suites have been published: LAPACK/ScaLAPACK for dense
linear algebra [8], Colella’s Seven Motif’s [15] for scientic
computing, PARSEC [7] and SPLASH2 [65], Rodinia [11],
Graph500 [50], SparseBench [19], GAP [6], SSCA#2 [30],
and Tartan [35] are just a few.
Several approaches to synthetic graph generation have
been proposed. Our work extends two of these, power law
graphs from Firehose, and Kronecker graphs from Graph500 [50].
FireHose is a suite of stream processing benchmarks [2],
one of a front-end generator of which is the biased power
law generator. Existing synthetic tensor generators like
SimTensor [22], Nway Toolbox [3], and the Tensor Tool-
box [4] are specic to tensors with Tucker [61], CANDE-
COMP/PARAFAC decomposition [9, 10] structures or partic-
ular data distributions. is paper provides a starting point
to generate sparse tensors that preserve the properties of
real-world or multi-aributed graphs that can be realized as
higher-order sparse tensors.
Many libraries support sparse tensor methods, such as Ten-
sor Toolbox [4], Nway Toolbox [3], Tensorlab [62], TACO [31],
SPLATT [57], and ParTI [39]. As a benchmark suite, we sup-
ply widely-adopted reference implementations and make
continuously eort to include state-of-the-art algorithms
and data structures as well.
7 Conclusion
is paper presents a benchmark suite targeting sparse ten-
sor kernels. Operations on sparse tensors are common in
a wide range of important applications. e operations are
memory bound and oen dominate application performance.
is benchmark suite identies important kernels and data
representations and provides reference implementations to
aid the community in eectively sharing and comparing
performance and optimization results. Two methods for
synthetic tensor generation are provided by preserving the
properties of real-world graphs. A subset of possible syn-
thetic tensors are used in this paper. e tool provides the
ability to generate custom synthetic tensors in a reproducible
manner.
Five observations are made based on performance anal-
ysis over Rooine models to gain insights of sparse tensor
behavior across architectures. is benchmark suite is a con-
tinuous eort: additional operations, such as TTM-chain in
Tucker decomposition, tensor contraction, a sparse tensor
with a sparse vector/matrix operations; more complete ten-
sor methods, such as CANDECOMP/PARAFAC and Tucker
decompositions; data representations, such as compressed
sparse ber (CSF) [57], balanced CSF (BCSF) [25]; more plat-
forms, such as distributed systems, multiple GPUs, and other
new architectures (e.g., FPGAs and Emu [21, 24]) will be
included to the suite in the future.
References
[1] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and
Matus Telgarsky. 2014. Tensor Decompositions for Learning Latent
Variable Models. J. Mach. Learn. Res. 15, 1 (Jan. 2014), 2773–2832.
[2] Karl Anderson and Steve Plimpton. 2015. FireHose Streaming Bench-
marks. Technical Report. Sandia National Laboratory.
[3] Claus A Andersson and Rasmus Bro. 2000. e N-way Toolbox for
MATLAB. Chemometrics and Intelligent Laboratory Systems 52, 1 (Aug
2000), 1–4.
[4] Bre W. Bader, Tamara G. Kolda, et al. 2017. MATLAB Tensor Toolbox
(Version 3.0-dev). Available online. hps://www.tensortoolbox.org
[5] M. Baskaran, B. Meister, N. Vasilache, and R. Lethin. 2012. Ecient
and scalable computations with sparse tensors. In High Performance
Extreme Computing (HPEC), 2012 IEEE Conference on. 1–6. hps:
//doi.org/10.1109/HPEC.2012.6408676
[6] Sco Beamer, Krste Asanovic´, and David Paerson. 2015. e GAP
benchmark suite. arXiv preprint arXiv:1508.03619 (2015).
[7] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008.
e PARSEC benchmark suite: Characterization and architectural
implications. In Proceedings of the 17th international conference on
Parallel architectures and compilation techniques. ACM, 72–81.
[8] Laura Susan Blackford, J. Choi, A. Cleary, A. Petitet, R. C. Whaley, J.
Demmel, I. Dhillon, K. Stanley, J. Dongarra, S. Hammarling, G. Henry,
and D. Walker. 1996. ScaLAPACK: A Portable Linear Algebra Library
for Distributed Memory Computers - Design Issues and Performance.
In Proceedings of the 1996 ACM/IEEE Conference on Supercomputing
(Supercomputing ’96). IEEE Computer Society, Washington, DC, USA,
Article 5. hps://doi.org/10.1145/369028.369038
[9] J. Douglas Carroll and Jih-Jie Chang. 1970. Analysis of individual
dierences in multidimensional scaling via an n-way generalization
of “Eckart-Young” decomposition. Psychometrika 35, 3 (01 Sep 1970),
283–319. hps://doi.org/10.1007/BF02310791
[10] J. D. Carroll, S. Pruzansky, and J. B. Kruskal. 1980. CANDELINC: A
general approach to multidimensional analysis of many-way arrays
with linear constraints on parameters. Psychometrika 45 (1980), 3–24.
[11] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaer, S. Lee, and K.
Skadron. 2009. Rodinia: A benchmark suite for heterogeneous com-
puting. In 2009 IEEE International Symposium on Workload Characteri-
zation (IISWC). 44–54. hps://doi.org/10.1109/IISWC.2009.5306797
[12] Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2018. For-
mat Abstraction for Sparse Tensor Algebra Compilers. Proc. ACM
Program. Lang. 2, OOPSLA, Article 123 (Oct. 2018), 30 pages. hps:
//doi.org/10.1145/3276493
[13] Andrzej Cichocki. 2014. Era of Big Data Processing: A New Approach
via Tensor Networks and Tensor Decompositions. CoRR abs/1403.2048
(2014).
[14] A. Cichocki, N. Lee, I. V. Oseledets, A. Phan, Q. Zhao, and D. Mandic.
2016. Low-Rank Tensor Networks for Dimensionality Reduction
and Large-Scale Optimization Problems: Perspectives and Challenges
PART 1. ArXiv e-prints (Sept. 2016). arXiv:cs.NA/1609.00893
[15] Phil Colella. 2004. Dening Soware Requirements for Scientic
Computing. (01 2004).
[16] Lieven De Lathauwer, Nico Vervliet, Martijn Bouss, and Oo Debals.
2017. Dealing with curse and blessing of dimensionality through
tensor decompositions.
Conference’17, July 2017, Washington, DC, USA Li et al.
[17] Edoardo Di Napoli, Diego Fabregat-Traver, Gregorio intana-Ort,
and Paolo Bientinesi. 2014. Towards an ecient use of the BLAS
library for multilinear tensor contractions. Appl. Math. Comput. 235
(2014), 454 – 468. hps://doi.org/10.1016/j.amc.2014.02.051
[18] Kaivalya M Dixit. 1991. e SPEC benchmarks. Parallel computing 17,
10-11 (1991), 1195–1209.
[19] Jack Dongarra, Victor Eijkhout, and Henk van der Vorst. 2001.
SparseBench: A sparse iterative benchmark.
[20] Jack J Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. e LIN-
PACK benchmark: past, present and future. Concurrency and Compu-
tation: practice and experience 15, 9 (2003), 803–820.
[21] Timothy Dysart, Peter Kogge, Martin Denero, Eric Bovell, Preston
Briggs, Jay Brockman, Kenneth Jacobsen, Yujen Juan, Shannon Kuntz,
Richard Lethin, Janice McMahon, Chandra Pawar, Martin Perrigo,
Sarah Rucker, John Ruenberg, Max Ruenberg, and Steve Stein. 2016.
Highly Scalable Near Memory Processing with Migrating reads on
the Emu System Architecture. In Proceedings of the Sixth Workshop
on Irregular Applications: Architectures and Algorithms (IA3ˆ ’16). IEEE
Press, Piscataway, NJ, USA, 2–9. hps://doi.org/10.1109/IA3.2016.7
[22] Hadi Fanaee-T and Joao Gama. 2016. SimTensor: A synthetic tensor
data generator. arXiv preprint arXiv:1612.03772 (2016).
[23] GeekBench. 4. Primate Labs. hp://hp://www.geekbench.com/
[24] Eric Hein, Tom Conte, Jerey S. Young, Srinivas Eswar, Jiajia Li, Patrick
Lavin, Richard Vuduc, and Jason Riedy. 2018. An Initial Characteriza-
tion of the Emu Chick. 2018 IEEE International Parallel and Distributed
Processing Symposium Workshops (May 2018), 10.
[25] Aravind Sukumaran-Rajam Richard Vuduc P. Sadayappan Israt Nisa,
Jiajia Li. 2019. Load-balanced sparse MTTKRP on GPUs. (To be
appeared).
[26] Inah Jeon, Evangelos E. Papalexakis, U Kang, and Christos Faloutsos.
2015. HaTen2: Billion-scale Tensor Decompositions. In IEEE Interna-
tional Conference on Data Engineering (ICDE).
[27] Inah Jeon, Evangelos E. Papalexakis, and Christos Faloutsos U Kang.
2015. HaTen2: Billion-scale Tensor Decompositions (Version 1.0).
Available from hp://datalab.snu.ac.kr/haten2/.
[28] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D convolutional
neural networks for human action recognition. IEEE transactions on
paern analysis and machine intelligence 35, 1 (2012), 221–231.
[29] O. Kaya and B. Uar. 2018. Parallel Candecomp/Parafac Decompo-
sition of Sparse Tensors Using Dimension Trees. SIAM Journal on
Scientic Computing 40, 1 (2018), C99–C130. hps://doi.org/10.1137/
16M1102744 arXiv:hps://doi.org/10.1137/16M1102744
[30] J Kepner, DP Koester, et al. 2005. HPCS SSCA# 2 Graph Analysis
Benchmark Specications v1. 0.
[31] Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and
Saman Amarasinghe. 2017. e Tensor Algebra Compiler. Proc. ACM
Program. Lang. 1, OOPSLA, Article 77 (Oct. 2017), 29 pages. hps:
//doi.org/10.1145/3133901
[32] T. Kolda and B. Bader. 2009. Tensor Decompositions and Applications.
SIAM Rev. 51, 3 (2009), 455–500. hps://doi.org/10.1137/07070111X
arXiv:hp://dx.doi.org/10.1137/07070111X
[33] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and
Victor Lempitsky. 2014. Speeding-up Convolutional Neural Networks
Using Fine-tuned CP-Decomposition. arXiv preprint arXiv:1412.6553
(2014).
[34] Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Falout-
sos, and Zoubin Ghahramani. 2010. Kronecker graphs: An approach
to modeling networks. Journal of Machine Learning Research 11, Feb
(2010), 985–1042.
[35] Ang Li, Shuaiwen Leon Song, Jieyang Chen, Xu Liu, Nathan Tallent,
and Kevin Barker. 2018. Tartan: Evaluating Modern GPU Intercon-
nect via a Multi-GPU Benchmark Suite. In 2018 IEEE International
Symposium on Workload Characterization (IISWC). IEEE, 191–202.
[36] Jiajia Li. 2018. Scalable tensor decompositions in high performance
computing environments. Ph.D. Dissertation. Georgia Institute of Tech-
nology, Atlanta, GA, USA.
[37] Jiajia Li, Casey Baaglino, Ioakeim Perros, Jimeng Sun, and Richard
Vuduc. 2015. An input-adaptive and in-place approach to dense tensor-
times-matrix multiply. In ACM/IEEE Supercomputing (SC ’15). ACM,
New York, NY, USA.
[38] J. Li, J. Choi, I. Perros, J. Sun, and R. Vuduc. 2017. Model-Driven
Sparse CP Decomposition for Higher-Order Tensors. In 2017 IEEE
International Parallel and Distributed Processing Symposium (IPDPS).
1048–1057. hps://doi.org/10.1109/IPDPS.2017.80
[39] Jiajia Li, Yuchen Ma, and Richard Vuduc. 2018. ParTI! : A Parallel
Tensor Infrastructure for multicore CPUs and GPUs (Version 1.0.0).
hps://github.com/hpcgarage/ParTI
[40] Jiajia Li, Yuchen Ma, Xiaolong Wu, Ang Li, and Kevin Barker. 2019.
PASTA: A Parallel Sparse Tensor Algorithm Benchmark Suite. arXiv
preprint arXiv:1902.03317 (2019).
[41] Jiajia Li, Yuchen Ma, Chenggang Yan, and Richard Vuduc. 2016. Op-
timizing Sparse Tensor Times Matrix on Multi-core and Many-core
Architectures. In Proceedings of the Sixth Workshop on Irregular Appli-
cations: Architectures and Algorithms (IAˆ3 ’16). IEEE Press, Piscataway,
NJ, USA, 26–33. hps://doi.org/10.1109/IA3.2016.10
[42] Jiajia Li, Jimeng Sun, and Richard Vuduc. 2018. HiCOO: Hierarchical
storage of sparse tensors. In Proceedings of the ACM/IEEE International
Conference on High Performance Computing, Networking, Storage and
Analysis (SC). Dallas, TX, USA.
[43] Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013.
SMAT: An Input Adaptive Auto-tuner for Sparse Matrix-vector Multi-
plication. In Proceedings of the 34th ACM SIGPLAN Conference on Pro-
gramming Language Design and Implementation (PLDI ’13). ACM, New
York, NY, USA, 117–126. hps://doi.org/10.1145/2491956.2462181
[44] Jiajia Li, Bora Uc¸ar, U¨mit V. C¸atalyu¨rek, Jimeng Sun, Kevin Barker, and
Richard Vuduc. 2019. Ecient and Eective Sparse Tensor Reordering.
In Proceedings of the ACM International Conference on Supercomputing
(ICS ’19). ACM, New York, NY, USA, 227–237. hps://doi.org/10.1145/
3330345.3330366
[45] B. Liu, C. Wen, A. D. Sarwate, and M. M. Dehnavi. 2017. A Unied
Optimization Approach for Sparse Tensor Operations on GPUs. In
2017 IEEE International Conference on Cluster Computing (CLUSTER).
47–57. hps://doi.org/10.1109/CLUSTER.2017.75
[46] Yu Jung Lo, Samuel Williams, Brian Van Straalen, Terry J. Ligocki,
Mahew J. Cordery, Nicholas J. Wright, Mary W. Hall, and Leonid
Oliker. 2015. Rooine Model Toolkit: A Practical Tool for Architectural
and Program Analysis. In High Performance Computing Systems. Per-
formance Modeling, Benchmarking, and Simulation, Stephen A. Jarvis,
Steven A. Wright, and Simon D. Hammond (Eds.). Springer Interna-
tional Publishing, Cham, 129–148.
[47] Yuchen Ma, Jiajia Li, Xiaolong Wu, Chenggang Yan, Jimeng Sun, and
Richard Vuduc. 2018. Optimizing sparse tensor times matrix on GPUs.
J. Parallel and Distrib. Comput. (2018). hps://doi.org/10.1016/j.jpdc.
2018.07.018
[48] Devin Mahews. 2016. High-Performance Tensor Contraction without
BLAS. CoRR abs/1607.00291 (2016). arXiv:1607.00291 hp://arxiv.org/
abs/1607.00291
[49] John D. McCalpin. 1991-2007. STREAM: Sustainable Memory Band-
width in High Performance Computers. (1991-2007). hp://www.
cs.virginia.edu/stream/ A continually updated technical report.
hp://www.cs.virginia.edu/stream/.
[50] Richard C Murphy, Kyle B Wheeler, Brian W Barre, and James A
Ang. 2010. Introducing the graph 500. Cray Users Group (CUG) 19
(2010), 45–74.
[51] Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry
Vetrov. 2015. Tensorizing Neural Networks. CoRR abs/1509.06569
(2015).
A Parallel Sparse Tensor Benchmark Suite on CPUs and GPUs Conference’17, July 2017, Washington, DC, USA
[52] Georg Ofenbeck, Ruedi Steinmann, Victoria Caparrs Cabezas,
Daniele G. Spampinato, and Markus Pu¨schel. 2014. Applying the
Rooine Model. In IEEE International Symposium on Performance Anal-
ysis of Systems and Soware (ISPASS). 76 – 85.
[53] Ioakeim Perros, Evangelos E. Papalexakis, Fei Wang, Richard Vuduc,
Elizabeth Searles, Michael ompson, and Jimeng Sun. 2017. SPARTan:
Scalable PARAFAC2 for Large & Sparse Data. In Proceedings of the
23rd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining (KDD ’17). ACM, New York, NY, USA, 375–384.
hps://doi.org/10.1145/3097983.3098014
[54] Naser Sedaghati, Te Mu, Louis-Noel Pouchet, Srinivasan
Parthasarathy, and P. Sadayappan. 2015. Automatic Selection
of Sparse Matrix Representation on GPUs. In Proceedings of the 29th
ACM on International Conference on Supercomputing (ICS ’15). ACM,
New York, NY, USA, 99–108. hps://doi.org/10.1145/2751205.2751244
[55] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis,
and C. Faloutsos. 2017. Tensor Decomposition for Signal Processing
and Machine Learning. IEEE Transactions on Signal Processing 65, 13
(July 2017), 3551–3582. hps://doi.org/10.1109/TSP.2017.2690524
[56] Shaden Smith, Jee W. Choi, Jiajia Li, Richard Vuduc, Jongsoo Park,
Xing Liu, and George Karypis. 2017. FROSTT: e Formidable Reposi-
tory of Open Sparse Tensors and Tools. hp://fros.io/
[57] Shaden Smith, Niranjay Ravindran, Nicholas Sidiropoulos, and George
Karypis. 2015. SPLATT: Ecient and Parallel Sparse Tensor-Matrix
Multiplication. In Proceedings of the 29th IEEE International Parallel &
Distributed Processing Symposium (IPDPS).
[58] Zack Smith. 2008. Bandwidth: a memory bandwidth benchmark.
[59] A Snavely, L Carrington, N Wolter, J Labarta, R Badia, and A
Purkayastha. 2002. A framework for performance modeling and pre-
diction. In Supercomputing 2002. IEEE, 21–21.
[60] Paul Springer, Tong Su, and Paolo Bientinesi. 2017. HPTT: A High-
performance Tensor Transposition C++ Library. In Proceedings of the
4th ACM SIGPLAN International Workshop on Libraries, Languages,
and Compilers for Array Programming (ARRAY 2017). ACM, New York,
NY, USA, 56–62. hps://doi.org/10.1145/3091966.3091968
[61] L. R. Tucker. 1966. Some mathematical notes on three-mode factor
analysis. Psychometrika 31 (1966), 279–311.
[62] N. Vervliet, O. Debals, L. Sorber, M. Van Barel, and L. De Lathauwer.
2016. Tensorlab (Version 3.0). Available from hp://www.tensorlab.
net.
[63] Samuel Williams, Andrew Waterman, and David Paerson. 2009.
Rooine: An Insightful Visual Performance Model for Multicore
Architectures. Commun. ACM 52, 4 (April 2009), 65–76. hps:
//doi.org/10.1145/1498765.1498785
[64] Samuel Webb Williams. 2008. Auto-tuning Performance on Multicore
Computers. Ph.D. Dissertation. EECS Department, University of Cali-
fornia, Berkeley. hp://www.eecs.berkeley.edu/Pubs/TechRpts/2008/
EECS-2008-164.html
[65] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal
Singh, and Anoop Gupta. 1995. e SPLASH-2 programs: Characteri-
zation and methodological considerations. ACM SIGARCH computer
architecture news 23, 2 (1995), 24–36.
[66] Rose Yu, Stephan Zheng, Anima Anandkumar, and Yisong Yue. 2018.
Long-term Forecasting using Tensor-Train RNNs. hps://openreview.
net/forum?id=HJJ0w--0W
[67] Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou,
and Mingyu Chen. 2017. Understanding the GPU Microarchitecture
to Achieve Bare-Metal Performance Tuning. In Proceedings of the
22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming (PPoPP ’17). ACM, New York, NY, USA, 31–43. hps:
//doi.org/10.1145/3018743.3018755
