729 research outputs found

    Scalable Task-Based Algorithm for Multiplication of Block-Rank-Sparse Matrices

    Full text link
    A task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC). The novel features of our formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and (2) fine-grained task-based composition. These features make it tolerant of the load imbalance due to the irregular matrix structure and eliminate all artifactual sources of global synchronization.Scalability of iterative computation of square-root inverse of block-rank-sparse QC matrices is demonstrated; for full-rank (dense) matrices the performance of our SUMMA formulation usually exceeds that of the state-of-the-art dense MM implementations (ScaLAPACK and Cyclops Tensor Framework).Comment: 8 pages, 6 figures, accepted to IA3 2015. arXiv admin note: text overlap with arXiv:1504.0504

    A High-Throughput Solver for Marginalized Graph Kernels on GPU

    Get PDF
    We present the design and optimization of a linear solver on General Purpose GPUs for the efficient and high-throughput evaluation of the marginalized graph kernel between pairs of labeled graphs. The solver implements a preconditioned conjugate gradient (PCG) method to compute the solution to a generalized Laplacian equation associated with the tensor product of two graphs. To cope with the gap between the instruction throughput and the memory bandwidth of current generation GPUs, our solver forms the tensor product linear system on-the-fly without storing it in memory when performing matrix-vector dot product operations in PCG. Such on-the-fly computation is accomplished by using threads in a warp to cooperatively stream the adjacency and edge label matrices of individual graphs by small square matrix blocks called tiles, which are then staged in registers and the shared memory for later reuse. Warps across a thread block can further share tiles via the shared memory to increase data reuse. We exploit the sparsity of the graphs hierarchically by storing only non-empty tiles using a coordinate format and nonzero elements within each tile using bitmaps. Besides, we propose a new partition-based reordering algorithm for aggregating nonzero elements of the graphs into fewer but denser tiles to improve the efficiency of the sparse format.We carry out extensive theoretical analyses on the graph tensor product primitives for tiles of various density and evaluate their performance on synthetic and real-world datasets. Our solver delivers three to four orders of magnitude speedup over existing CPU-based solvers such as GraKeL and GraphKernels. The capability of the solver enables kernel-based learning tasks at unprecedented scales

    High Performance Parallel Algorithms for the Tucker Decomposition of Sparse Tensors

    Get PDF
    International audience—We investigate an efficient parallelization of a class of algorithms for the well-known Tucker decomposition of general N-dimensional sparse tensors. The targeted algorithms are iterative and use the alternating least squares method. At each iteration, for each dimension of an N-dimensional input tensor, the following operations are performed: (i) the tensor is multiplied with (N − 1) matrices (TTMc step); (ii) the product is then converted to a matrix; and (iii) a few leading left singular vectors of the resulting matrix are computed (TRSVD step) to update one of the matrices for the next TTMc step. We propose an efficient parallelization of these algorithms for the current parallel platforms with multicore nodes. We discuss a set of preprocessing steps which takes all computational decisions out of the main iteration of the algorithm and provides an intuitive shared-memory parallelism for the TTM and TRSVD steps. We propose a coarse and a fine-grain parallel algorithm in a distributed memory environment, investigate data dependencies, and identify efficient communication schemes. We demonstrate how the computation of singular vectors in the TRSVD step can be carried out efficiently following the TTMc step. Finally, we develop a hybrid MPI-OpenMP implementation of the overall algorithm and report scalability results on up to 4096 cores on 256 nodes of an IBM BlueGene/Q supercomputer

    Dynasor: A Dynamic Memory Layout for Accelerating Sparse MTTKRP for Tensor Decomposition on Multi-core CPU

    Full text link
    Sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP) is the most time-consuming compute kernel in sparse tensor decomposition. In this paper, we introduce a novel algorithm to minimize the execution time of spMTTKRP across all modes of an input tensor on multi-core CPU platform. The proposed algorithm leverages the FLYCOO tensor format to exploit data locality in external memory accesses. It effectively utilizes computational resources by enabling lock-free concurrent processing of independent partitions of the input tensor. The proposed partitioning ensures load balancing among CPU threads. Our dynamic tensor remapping technique leads to reduced communication overhead along all the modes. On widely used real-world tensors, our work achieves 2.12x - 9.01x speedup in total execution time across all modes compared with the state-of-the-art CPU implementations

    Pipelining the Fast Multipole Method over a Runtime System

    Get PDF
    Fast Multipole Methods (FMM) are a fundamental operation for the simulation of many physical problems. The high performance design of such methods usually requires to carefully tune the algorithm for both the targeted physics and the hardware. In this paper, we propose a new approach that achieves high performance across architectures. Our method consists of expressing the FMM algorithm as a task flow and employing a state-of-the-art runtime system, StarPU, in order to process the tasks on the different processing units. We carefully design the task flow, the mathematical operators, their Central Processing Unit (CPU) and Graphics Processing Unit (GPU) implementations, as well as scheduling schemes. We compute potentials and forces of 200 million particles in 48.7 seconds on a homogeneous 160 cores SGI Altix UV 100 and of 38 million particles in 13.34 seconds on a heterogeneous 12 cores Intel Nehalem processor enhanced with 3 Nvidia M2090 Fermi GPUs.Comment: No. RR-7981 (2012
    • …
    corecore