9 research outputs found
A μ-mode BLAS approach for multidimensional tensor-structured problems
In this manuscript, we present a common tensor framework which can be used to generalize one-dimensional numerical tasks to arbitrary dimension d by means of tensor product formulas. This is useful, for example, in the context of multivariate interpolation, multidimensional function approximation using pseudospectral expansions and solution of stiff differential equations on tensor product domains. The key point to obtain an efficient-to-implement BLAS formulation consists in the suitable usage of the mu-mode product (also known as tensor-matrix product or mode-n product) and related operations, such as the Tucker operator. Their MathWorks MATLAB (R)/GNU Octave implementations are discussed in the paper, and collected in the package KronPACK. We present numerical results on experiments up to dimension six from different fields of numerical analysis, which show the effectiveness of the approach
TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition
Tucker decomposition is one of the SOTA CNN model compression techniques.
However, unlike the FLOPs reduction, we observe very limited inference time
reduction with Tuckercompressed models using existing GPU software such as
cuDNN. To this end, we propose an efficient end-to-end framework that can
generate highly accurate and compact CNN models via Tucker decomposition and
optimized inference code on GPUs. Specifically, we propose an ADMM-based
training algorithm that can achieve highly accurate Tucker-format models. We
also develop a high-performance kernel for Tucker-format convolutions and
analytical performance models to guide the selection of execution parameters.
We further propose a co-design framework to determine the proper Tucker ranks
driven by practical inference time (rather than FLOPs). Our evaluation on five
modern CNNs with A100 demonstrates that our compressed models with our
optimized code achieve up to 3.14X speedup over cuDNN, 1.45X speedup over TVM,
and 4.57X over the original models using cuDNN with up to 0.05% accuracy loss.Comment: 12 pages, 8 figures, 3 tables, accepted by PPoPP '2
Tensor Contractions with Extended BLAS Kernels on CPU and GPU
Tensor contractions constitute a key computational ingredient of numerical multi-linear algebra. However, as the order and dimension of tensors grow, the time and space complexities of tensor-based computations grow quickly. In this paper, we propose and evaluate new BLAS-like primitives that are capable of performing a wide range of tensor contractions on CPU and GPU efficiently. We begin by focusing on single-index contractions involving all the possible configurations of second-order and third-order tensors. Then, we discuss extensions to more general cases. Existing approaches for tensor contractions spend large amounts of time restructuring the data which typically involves explicit copy and transpose operations. In this work, we summarize existing approaches and present library-based approaches that avoid memory movement. Through systematic benchmarking, we demonstrate that our approach can achieve 10x speedup on a K40c GPU and 2x speedup on dual-socket Haswell-EP CPUs, using MKL and CUBLAS respectively, for small and moderate tensor sizes. This is relevant in many machine learning applications such as deep learning, where tensor sizes tend to be small, but require numerous tensor contraction operations to be performed successively. Concretely, we implement a Tucker decomposition and show that using our kernels yields atleast an order of magnitude speedup as compared to state-of-the-art libraries
Analyzing the Performance Portability of Tensor Decomposition
We employ pressure point analysis and roofline modeling to identify
performance bottlenecks and determine an upper bound on the performance of the
Canonical Polyadic Alternating Poisson Regression Multiplicative Update (CP-APR
MU) algorithm in the SparTen software library. Our analyses reveal that a
particular matrix computation, , is the critical performance
bottleneck in the SparTen CP-APR MU implementation. Moreover, we find that
atomic operations are not a critical bottleneck while higher cache reuse can
provide a non-trivial performance improvement. We also utilize grid search on
the Kokkos library parallel policy parameters to achieve 2.25x average speedup
over the SparTen default for computation on CPU and 1.70x on GPU.
We conclude our investigations by comparing Kokkos implementations of the
STREAM benchmark and the matricized tensor times Khatri-Rao product (MTTKRP)
benchmark from the Parallel Sparse Tensor Algorithm (PASTA) benchmark suite to
implementations using vendor libraries. We show that with a single
implementation Kokkos achieves performance comparable to hand-tuned code for
fundamental operations that make up tensor decomposition kernels on a wide
range of CPU and GPU systems. Overall, we conclude that Kokkos demonstrates
good performance portability for simple data-intensive operations but requires
tuning for algorithms with more complex dependencies and data access patterns.Comment: 28 pages, 19 figure
Exponential integrators: tensor structured problems and applications
The solution of stiff systems of Ordinary Differential Equations (ODEs), that typically arise after spatial discretization of many important evolutionary Partial Differential Equations (PDEs), constitutes a topic of wide interest in numerical analysis. A prominent way to numerically integrate such systems involves using exponential integrators. In general, these kinds of schemes do not require the solution of (non)linear systems but rather the action of the matrix exponential and of some specific exponential-like functions (known in the literature as phi-functions). In this PhD thesis we aim at presenting efficient tensor-based tools to approximate such actions, both from a theoretical and from a practical point of view, when the problem has an underlying Kronecker sum structure. Moreover, we investigate the application of exponential integrators to compute numerical solutions of important equations in various fields, such as plasma physics, mean-field optimal control and computational chemistry. In any case, we provide several numerical examples and we perform extensive simulations, eventually exploiting modern hardware architectures such as multi-core Central Processing Units (CPUs) and Graphic Processing Units (GPUs). The results globally show the effectiveness and the superiority of the different approaches proposed
Remote Sensing Data Compression
A huge amount of data is acquired nowadays by different remote sensing systems installed on satellites, aircrafts, and UAV. The acquired data then have to be transferred to image processing centres, stored and/or delivered to customers. In restricted scenarios, data compression is strongly desired or necessary. A wide diversity of coding methods can be used, depending on the requirements and their priority. In addition, the types and properties of images differ a lot, thus, practical implementation aspects have to be taken into account. The Special Issue paper collection taken as basis of this book touches on all of the aforementioned items to some degree, giving the reader an opportunity to learn about recent developments and research directions in the field of image compression. In particular, lossless and near-lossless compression of multi- and hyperspectral images still remains current, since such images constitute data arrays that are of extremely large size with rich information that can be retrieved from them for various applications. Another important aspect is the impact of lossless compression on image classification and segmentation, where a reasonable compromise between the characteristics of compression and the final tasks of data processing has to be achieved. The problems of data transition from UAV-based acquisition platforms, as well as the use of FPGA and neural networks, have become very important. Finally, attempts to apply compressive sensing approaches in remote sensing image processing with positive outcomes are observed. We hope that readers will find our book useful and interestin
Efficient Matricization of n-D Array with CUDA and Its Evaluation
International audienceScientific and engineering computing requires operation on flooded amount of data having very high number of dimensions. Traditional multidimensional array is widely popular for implementing higher dimensional data but its' performance diminishes with the increase of the number of dimensions. On the other side, traditional row-column view is facile for implementation, imagination and visualization. This paper details a representation scheme for higher dimensional array with row-column abstraction on parallel environment. Odd dimensions contribute along row-direction and even dimensions along column direction which gives lower cost of index computation, higher data locality and parallelism. Each 2-D block of size blockIdx.x × threadIdx.x is independent of each other. Theoretically, it has no limitation with the number of dimensions and mapping algorithm is unique for any number of dimensions. Performance of the proposed matricization is measured with matrix-matrix addition, subtraction and multiplication operation. Experimental results show promising performance improvement over Traditional Multidimensional Array (TMA) and Extended Karnaugh Map Representation (EKMR). Thus the scheme can be used for implementing higher dimensional array in both general purpose and scientific computing on GPU. © 2016 IEEE