196 research outputs found
On the Performance Prediction of BLAS-based Tensor Contractions
Tensor operations are surging as the computational building blocks for a
variety of scientific simulations and the development of high-performance
kernels for such operations is known to be a challenging task. While for
operations on one- and two-dimensional tensors there exist standardized
interfaces and highly-optimized libraries (BLAS), for higher dimensional
tensors neither standards nor highly-tuned implementations exist yet. In this
paper, we consider contractions between two tensors of arbitrary dimensionality
and take on the challenge of generating high-performance implementations by
resorting to sequences of BLAS kernels. The approach consists in breaking the
contraction down into operations that only involve matrices or vectors. Since
in general there are many alternative ways of decomposing a contraction, we are
able to methodically derive a large family of algorithms. The main contribution
of this paper is a systematic methodology to accurately identify the fastest
algorithms in the bunch, without executing them. The goal is instead
accomplished with the help of a set of cache-aware micro-benchmarks for the
underlying BLAS kernels. The predictions we construct from such benchmarks
allow us to reliably single out the best-performing algorithms in a tiny
fraction of the time taken by the direct execution of the algorithms.Comment: Submitted to PMBS1
Performance Modeling and Prediction for Dense Linear Algebra
This dissertation introduces measurement-based performance modeling and
prediction techniques for dense linear algebra algorithms. As a core principle,
these techniques avoid executions of such algorithms entirely, and instead
predict their performance through runtime estimates for the underlying compute
kernels. For a variety of operations, these predictions allow to quickly select
the fastest algorithm configurations from available alternatives. We consider
two scenarios that cover a wide range of computations:
To predict the performance of blocked algorithms, we design
algorithm-independent performance models for kernel operations that are
generated automatically once per platform. For various matrix operations,
instantaneous predictions based on such models both accurately identify the
fastest algorithm, and select a near-optimal block size.
For performance predictions of BLAS-based tensor contractions, we propose
cache-aware micro-benchmarks that take advantage of the highly regular
structure inherent to contraction algorithms. At merely a fraction of a
contraction's runtime, predictions based on such micro-benchmarks identify the
fastest combination of tensor traversal and compute kernel
Simulating the weak death of the neutron in a femtoscale universe with near-Exascale computing
The fundamental particle theory called Quantum Chromodynamics (QCD) dictates
everything about protons and neutrons, from their intrinsic properties to
interactions that bind them into atomic nuclei. Quantities that cannot be fully
resolved through experiment, such as the neutron lifetime (whose precise value
is important for the existence of light-atomic elements that make the sun shine
and life possible), may be understood through numerical solutions to QCD. We
directly solve QCD using Lattice Gauge Theory and calculate nuclear observables
such as neutron lifetime. We have developed an improved algorithm that
exponentially decreases the time-to solution and applied it on the new CORAL
supercomputers, Sierra and Summit. We use run-time autotuning to distribute GPU
resources, achieving 20% performance at low node count. We also developed
optimal application mapping through a job manager, which allows CPU and GPU
jobs to be interleaved, yielding 15% of peak performance when deployed across
large fractions of CORAL.Comment: 2018 Gordon Bell Finalist: 9 pages, 9 figures; v2: fixed 2 typos and
appended acknowledgement
Scalable Task-Based Algorithm for Multiplication of Block-Rank-Sparse Matrices
A task-based formulation of Scalable Universal Matrix Multiplication
Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is
applied to the multiplication of hierarchy-free, rank-structured matrices that
appear in the domain of quantum chemistry (QC). The novel features of our
formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and
(2) fine-grained task-based composition. These features make it tolerant of the
load imbalance due to the irregular matrix structure and eliminate all
artifactual sources of global synchronization.Scalability of iterative
computation of square-root inverse of block-rank-sparse QC matrices is
demonstrated; for full-rank (dense) matrices the performance of our SUMMA
formulation usually exceeds that of the state-of-the-art dense MM
implementations (ScaLAPACK and Cyclops Tensor Framework).Comment: 8 pages, 6 figures, accepted to IA3 2015. arXiv admin note: text
overlap with arXiv:1504.0504
Tensor Contractions with Extended BLAS Kernels on CPU and GPU
Tensor contractions constitute a key computational ingredient of numerical multi-linear algebra. However, as the order and dimension of tensors grow, the time and space complexities of tensor-based computations grow quickly. In this paper, we propose and evaluate new BLAS-like primitives that are capable of performing a wide range of tensor contractions on CPU and GPU efficiently. We begin by focusing on single-index contractions involving all the possible configurations of second-order and third-order tensors. Then, we discuss extensions to more general cases. Existing approaches for tensor contractions spend large amounts of time restructuring the data which typically involves explicit copy and transpose operations. In this work, we summarize existing approaches and present library-based approaches that avoid memory movement. Through systematic benchmarking, we demonstrate that our approach can achieve 10x speedup on a K40c GPU and 2x speedup on dual-socket Haswell-EP CPUs, using MKL and CUBLAS respectively, for small and moderate tensor sizes. This is relevant in many machine learning applications such as deep learning, where tensor sizes tend to be small, but require numerous tensor contraction operations to be performed successively. Concretely, we implement a Tucker decomposition and show that using our kernels yields atleast an order of magnitude speedup as compared to state-of-the-art libraries
- …