9 research outputs found

    A μ-mode BLAS approach for multidimensional tensor-structured problems

    Get PDF
    In this manuscript, we present a common tensor framework which can be used to generalize one-dimensional numerical tasks to arbitrary dimension d by means of tensor product formulas. This is useful, for example, in the context of multivariate interpolation, multidimensional function approximation using pseudospectral expansions and solution of stiff differential equations on tensor product domains. The key point to obtain an efficient-to-implement BLAS formulation consists in the suitable usage of the mu-mode product (also known as tensor-matrix product or mode-n product) and related operations, such as the Tucker operator. Their MathWorks MATLAB (R)/GNU Octave implementations are discussed in the paper, and collected in the package KronPACK. We present numerical results on experiments up to dimension six from different fields of numerical analysis, which show the effectiveness of the approach

    TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition

    Full text link
    Tucker decomposition is one of the SOTA CNN model compression techniques. However, unlike the FLOPs reduction, we observe very limited inference time reduction with Tuckercompressed models using existing GPU software such as cuDNN. To this end, we propose an efficient end-to-end framework that can generate highly accurate and compact CNN models via Tucker decomposition and optimized inference code on GPUs. Specifically, we propose an ADMM-based training algorithm that can achieve highly accurate Tucker-format models. We also develop a high-performance kernel for Tucker-format convolutions and analytical performance models to guide the selection of execution parameters. We further propose a co-design framework to determine the proper Tucker ranks driven by practical inference time (rather than FLOPs). Our evaluation on five modern CNNs with A100 demonstrates that our compressed models with our optimized code achieve up to 3.14X speedup over cuDNN, 1.45X speedup over TVM, and 4.57X over the original models using cuDNN with up to 0.05% accuracy loss.Comment: 12 pages, 8 figures, 3 tables, accepted by PPoPP '2

    Tensor Contractions with Extended BLAS Kernels on CPU and GPU

    Get PDF
    Tensor contractions constitute a key computational ingredient of numerical multi-linear algebra. However, as the order and dimension of tensors grow, the time and space complexities of tensor-based computations grow quickly. In this paper, we propose and evaluate new BLAS-like primitives that are capable of performing a wide range of tensor contractions on CPU and GPU efficiently. We begin by focusing on single-index contractions involving all the possible configurations of second-order and third-order tensors. Then, we discuss extensions to more general cases. Existing approaches for tensor contractions spend large amounts of time restructuring the data which typically involves explicit copy and transpose operations. In this work, we summarize existing approaches and present library-based approaches that avoid memory movement. Through systematic benchmarking, we demonstrate that our approach can achieve 10x speedup on a K40c GPU and 2x speedup on dual-socket Haswell-EP CPUs, using MKL and CUBLAS respectively, for small and moderate tensor sizes. This is relevant in many machine learning applications such as deep learning, where tensor sizes tend to be small, but require numerous tensor contraction operations to be performed successively. Concretely, we implement a Tucker decomposition and show that using our kernels yields atleast an order of magnitude speedup as compared to state-of-the-art libraries

    Analyzing the Performance Portability of Tensor Decomposition

    Full text link
    We employ pressure point analysis and roofline modeling to identify performance bottlenecks and determine an upper bound on the performance of the Canonical Polyadic Alternating Poisson Regression Multiplicative Update (CP-APR MU) algorithm in the SparTen software library. Our analyses reveal that a particular matrix computation, Φ(n)\Phi^{(n)}, is the critical performance bottleneck in the SparTen CP-APR MU implementation. Moreover, we find that atomic operations are not a critical bottleneck while higher cache reuse can provide a non-trivial performance improvement. We also utilize grid search on the Kokkos library parallel policy parameters to achieve 2.25x average speedup over the SparTen default for Φ(n)\Phi^{(n)} computation on CPU and 1.70x on GPU. We conclude our investigations by comparing Kokkos implementations of the STREAM benchmark and the matricized tensor times Khatri-Rao product (MTTKRP) benchmark from the Parallel Sparse Tensor Algorithm (PASTA) benchmark suite to implementations using vendor libraries. We show that with a single implementation Kokkos achieves performance comparable to hand-tuned code for fundamental operations that make up tensor decomposition kernels on a wide range of CPU and GPU systems. Overall, we conclude that Kokkos demonstrates good performance portability for simple data-intensive operations but requires tuning for algorithms with more complex dependencies and data access patterns.Comment: 28 pages, 19 figure

    Exponential integrators: tensor structured problems and applications

    Get PDF
    The solution of stiff systems of Ordinary Differential Equations (ODEs), that typically arise after spatial discretization of many important evolutionary Partial Differential Equations (PDEs), constitutes a topic of wide interest in numerical analysis. A prominent way to numerically integrate such systems involves using exponential integrators. In general, these kinds of schemes do not require the solution of (non)linear systems but rather the action of the matrix exponential and of some specific exponential-like functions (known in the literature as phi-functions). In this PhD thesis we aim at presenting efficient tensor-based tools to approximate such actions, both from a theoretical and from a practical point of view, when the problem has an underlying Kronecker sum structure. Moreover, we investigate the application of exponential integrators to compute numerical solutions of important equations in various fields, such as plasma physics, mean-field optimal control and computational chemistry. In any case, we provide several numerical examples and we perform extensive simulations, eventually exploiting modern hardware architectures such as multi-core Central Processing Units (CPUs) and Graphic Processing Units (GPUs). The results globally show the effectiveness and the superiority of the different approaches proposed

    Remote Sensing Data Compression

    Get PDF
    A huge amount of data is acquired nowadays by different remote sensing systems installed on satellites, aircrafts, and UAV. The acquired data then have to be transferred to image processing centres, stored and/or delivered to customers. In restricted scenarios, data compression is strongly desired or necessary. A wide diversity of coding methods can be used, depending on the requirements and their priority. In addition, the types and properties of images differ a lot, thus, practical implementation aspects have to be taken into account. The Special Issue paper collection taken as basis of this book touches on all of the aforementioned items to some degree, giving the reader an opportunity to learn about recent developments and research directions in the field of image compression. In particular, lossless and near-lossless compression of multi- and hyperspectral images still remains current, since such images constitute data arrays that are of extremely large size with rich information that can be retrieved from them for various applications. Another important aspect is the impact of lossless compression on image classification and segmentation, where a reasonable compromise between the characteristics of compression and the final tasks of data processing has to be achieved. The problems of data transition from UAV-based acquisition platforms, as well as the use of FPGA and neural networks, have become very important. Finally, attempts to apply compressive sensing approaches in remote sensing image processing with positive outcomes are observed. We hope that readers will find our book useful and interestin

    Efficient Matricization of n-D Array with CUDA and Its Evaluation

    No full text
    International audienceScientific and engineering computing requires operation on flooded amount of data having very high number of dimensions. Traditional multidimensional array is widely popular for implementing higher dimensional data but its' performance diminishes with the increase of the number of dimensions. On the other side, traditional row-column view is facile for implementation, imagination and visualization. This paper details a representation scheme for higher dimensional array with row-column abstraction on parallel environment. Odd dimensions contribute along row-direction and even dimensions along column direction which gives lower cost of index computation, higher data locality and parallelism. Each 2-D block of size blockIdx.x × threadIdx.x is independent of each other. Theoretically, it has no limitation with the number of dimensions and mapping algorithm is unique for any number of dimensions. Performance of the proposed matricization is measured with matrix-matrix addition, subtraction and multiplication operation. Experimental results show promising performance improvement over Traditional Multidimensional Array (TMA) and Extended Karnaugh Map Representation (EKMR). Thus the scheme can be used for implementing higher dimensional array in both general purpose and scientific computing on GPU. © 2016 IEEE
    corecore