Tensor computations present significant performance challenges that impact a wide spectrum of applications. Efforts on improving the performance of tensor computations include exploring data layout, execution scheduling, and parallelism in common tensor kernels. This work presents a benchmark suite for arbitrary-order sparse tensor kernels using state-ofthe-art tensor formats: coordinate (COO) and hierarchical coordinate (HiCOO). It demonstrates a set of reference tensor kernel implementations and some observations on Intel CPUs and NVIDIA GPUs. The full paper can be referred to at http://arxiv.org/abs/2001.00660.
Introduction
Tensors (as multi-dimensional arrays) especially sparse tensors, are utilized by a large number of critical applications that span a range of domain areas, which include quantum chemistry, healthcare analytics, social network analysis, data mining, signal processing, machine learning, and more. Operations on sparse tensors tend to dominate the executiontime of these applications. Understanding the performance characteristics of different implementation approaches is of paramount importance. This paper presents a benchmark Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for thirdparty components of this work must be honored. For all other uses, contact the owner/author(s). suite specifically for that purpose. The suite provides implementations of common tensor kernels using state-of-the-art sparse tensor data structures and a variety of real sparse tensors as its input dataset.
Given the heterogeneity in available hardware resources for high performance computing (HPC), it is non-trivial to answer questions about the potential for sparse tensor algorithms to be efficiently ported to various hardware. The difficulty of planning for the irregular parallelism that results from operating on sparse data structures is compounded by the availability of Graphics Processing Units (GPUs), vectorizing units, Field Programmable Gate Arrays (FPGAs), and potentially Tensor Processing Units (TPUs).
Optimizing the performance of tensor applications is challenging due to several application characteristics, named in [1, 3, 4] : curse of dimensionality, mode orientation, tensor transformation, irregularity, and arbitrary tensor orders (or dimensions). Beyond these, challenges associated with all benchmarks also apply, which include completeness, diversity, extendibility, reproducibility, and comparability across implementations. Comparisons across research groups and optimizations will be improved by using a standard set of kernels and inputs.
Contributions
Our benchmark suite consists of a set of reference implementations from various tensor applications, each of which show different computational behavior. We implement two sparse tensor formats: the most popular and mode-generic coordinate (COO) format and a newly proposed, more compressed hierarchical coordinate (HiCOO) format [5] to represent general, arbitrary sparse tensors. Beyond the implementation diversity, platform and workload (or input) diversity is also critical to gain insights from a benchmark suite. We implement the same set of tensor kernels on CPUs and GPUs to provide a good understanding for users. Different inputs of an algorithm usually obtain different performance due to their diverse data sizes and patterns. This phenomenon is more obvious for sparse problems because their algorithm behavior largely depends on the features of data. Besides, our benchmark suite can easily adopt new sparse tensor kernels and data representations.
The contributions of this work include:
• reference implementations for five tensor kernels [1, 3, 4] : tensor element-wise (Tew), tensor-scalar (Ts), tensor-times-vector (Ttv), tensor-times-matrix (Ttm), and matriced tensor times Khatri-Rao product (Mttkrp), in COO and HiCOO formats on CPUs and GPUs; • Roofline performance models for one multicore Intel CPU and one NVIDIA GPU platforms to analyze the tensor kernels; and insights gained from experiments and analysis of the performance.
Experimental Results
We perform experiments on an Intel Xeon Gold 6126 multicore server platform with 56 physical 2.6GHz-cores distributed on four sockets and an NVIDIA Tesla V100 GPU with 5120 1.53GHz-cores. Figure 1 plots the Roofline models for the two platforms with DRAM and last-level cache (LLC) bandwidth tested from the Empirical Roofline Tool (ERT) [6] , and the theoretical peak SP performance and DRAM bandwidth (not cache-aware) for reference. The dataset, described in Table 1 , uses sparse tensors derived from real-world applications [2, 7] . Observation 1: Achieved performance is diverse and hard to predict, which varies with the dimension sizes and non-zero patterns of tensors, platforms, and data formats.
Observation 2: Performance is generally below the Roofline performance calculated from main/global memory bandwidth except for some small tensors fitting into caches or algorithms with good data locality thus making a good use of caches.
Observation 3: It is hard to obtain good performance efficiency for non-streaming kernels on multi-socket CPU machines because of NUMA effect, which might be even harder than on GPUs.
Observation 4: HiCOO algorithms is faster than or similar to COO counterparts because of its better local locality and smaller memory footprint, except Mttkrp on GPUs where load imbalance and lower parallelism play more important roles. 
Conclusion
This paper presents a benchmark suite targeting sparse tensor kernels, which are memory bound and often dominate application performance. It identifies important kernels and data representations and provides reference implementations to aid the community in effectively sharing and comparing performance and optimization results. This benchmark suite is a continuous effort: more operations and complete tensor methods, data representations, platforms will be included.
