4 research outputs found
Higher-order Count Sketch: Dimensionality Reduction That Retains Efficient Tensor Operations
Sketching is a randomized dimensionality-reduction method that aims to
preserve relevant information in large-scale datasets. Count sketch is a simple
popular sketch which uses a randomized hash function to achieve compression. In
this paper, we propose a novel extension known as Higher-order Count Sketch
(HCS). While count sketch uses a single hash function, HCS uses multiple
(smaller) hash functions for sketching. HCS reshapes the input (vector) data
into a higher-order tensor and employs a tensor product of the random hash
functions to compute the sketch. This results in an exponential saving (with
respect to the order of the tensor) in the memory requirements of the hash
functions, under certain conditions on the input data. Furthermore, when the
input data itself has an underlying structure in the form of various tensor
representations such as the Tucker decomposition, we obtain significant
advantages. We derive efficient (approximate) computation of various tensor
operations such as tensor products and tensor contractions directly on the
sketched data. Thus, HCS is the first sketch to fully exploit the
multi-dimensional nature of higher-order tensors. We apply HCS to tensorized
neural networks where we replace fully connected layers with sketched tensor
operations. We achieve nearly state of the art accuracy with significant
compression on the image classification benchmark
Optimizing sparse tensor times matrix on GPUs
© 2018 Elsevier Inc. This work optimizes tensor-times-dense matrix multiply (Ttm) for general sparse and semi-sparse tensors on CPU and NVIDIA GPU platforms. Ttm is a computational kernel in tensor methods-based data analytics and data mining applications, such as the popular Tucker decomposition. We first design an in-place sequential SpTtm to avoid explicit data reorganizing between a tensor and a matrix in its conventional approach. We further optimize SpTtm on NVIDIA GPU platforms. Five approaches including employing fine thread granularity, arranging coalesced memory access, rank blocking, and using fast GPU shared memory are developed for GPU-SpTtm. We also optimize semi-sparse tensor-times-dense matrix multiply (SspTtm) to take advantage of the inside dense sub-structures. The optimized SpTtm and SspTtm are applied to Tucker decomposition to improve its overall performance. Our sequential SpTtm is 3–120× faster than the SpTtm from Tensor Toolbox library. GPU-SpTtm obtains 6–19× speedup on NVIDIA K40c and 23–67× speedup on NVIDIA P100 over CPU-SpTtm respectively. Our GPU-SpTtm is 3.9× faster than the state-of-the-art GPU implementation. Our SspTtm implementations outperform SpTtms by up to 4.5×, which handles the input semi-sparse tensor in a general way. Tucker decomposition achieves up to 3.2× speedup after applying the optimized Ttms. The code will be publicly released in ParTI! library: https://github.com/hpcgarage/ParTI