Search CORE

160 research outputs found

Finite Element Integration on GPUs

Author: Andy R. Terrel
Datta K.
Markall G.
Maruyama N.
Matthew G. Knepley
Murthy G.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 28/02/2011
Field of study

We present a novel finite element integration method for low order elements on GPUs. We achieve more than 100GF for element integration on first order discretizations of both the Laplacian and Elasticity operators.Comment: 16 pages, 3 figure

arXiv.org e-Print Archive

Master of Science

Author: Rivera Axel Y.
Publication venue: University of Utah
Publication date: 01/12/2014
Field of study

thesisTensors are mathematical representations of physical entities that have magnitude with multiple directions. Tensor contraction is a form of creating these objects using the Einstein summation equation. It is commonly used in physics and chemistry for solving problems like spectral elements and coupled cluster computation. Mathematically, tensor contraction operations can be reduced to expressions similar to matrix multiplications. However, linear algebra libraries (e.g., BLAS and LAPACK) perform poorly on the small matrix sizes that commonly arise in certain tensor contraction computations. Another challenge seen in the computation of tensor contraction is the dierence between the mathematical representation and an ecient implementation. This thesis proposes a framework that allows users to express a tensor contraction problem in a high-level mathematical representation and transform it into a linear algebra expression that is mapped to a high-performance implementation. The framework produces code that takes advantage of the parallelism that graphics processing units (GPUs) provide. It relies on autotuning to nd the preferred implementation that achieves high performance on the available device. Performance results from the benchmarks tested, nekbone and NWChem, show that the output of the framework achieves a speedup of 8.56x and 14.25x, respectively, on an NVIDIA Tesla C2050 GPU against the sequential version; while using an NVIDIA Tesla K20c GPU it achieved speedups of 8.87x and 17.62x. The parallel decompositions found by the tool were also tested with an OpenACC implementation and achieved a speedup of 8.87x and 10.42x for nekbone, while NWChem obtained a speedup of 7.25x and 10.34x compared to the choices made by default in the OpenACC compiler. The contributions of this work are: (1) a simplied interface that allows the user to express tensor contraction using a high-level representation and transform it into high-performance code; (2) a decision algorithm that explores a set of optimization strategies for achieving performance; and, (3) a demonstration that this approach can achieve better performance than OpenACC and can be used to accelerate OpenACC

Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Author: Anandkumar Animashree
Cecka Cris
Niranjan U. N.
Shi Yang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2016
Field of study

Tensor contractions constitute a key computational ingredient of numerical multi-linear algebra. However, as the order and dimension of tensors grow, the time and space complexities of tensor-based computations grow quickly. In this paper, we propose and evaluate new BLAS-like primitives that are capable of performing a wide range of tensor contractions on CPU and GPU efficiently. We begin by focusing on single-index contractions involving all the possible configurations of second-order and third-order tensors. Then, we discuss extensions to more general cases. Existing approaches for tensor contractions spend large amounts of time restructuring the data which typically involves explicit copy and transpose operations. In this work, we summarize existing approaches and present library-based approaches that avoid memory movement. Through systematic benchmarking, we demonstrate that our approach can achieve 10x speedup on a K40c GPU and 2x speedup on dual-socket Haswell-EP CPUs, using MKL and CUBLAS respectively, for small and moderate tensor sizes. This is relevant in many machine learning applications such as deep learning, where tensor sizes tend to be small, but require numerous tensor contraction operations to be performed successively. Concretely, we implement a Tucker decomposition and show that using our kernels yields atleast an order of magnitude speedup as compared to state-of-the-art libraries

Batched Second-Order Adjoint Sensitivity for Reduced Space Methods

Author: Anitescu Mihai
Churavy Valentin
Maldonado Daniel Adrian
Montoison Alexis
Pacaud François
Samaroo Julian
Schanen Michel
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2022
Field of study

This paper presents an efficient method for extracting the second-order sensitivities from a system of implicit nonlinear equations on upcoming graphical processing units (GPU) dominated computer systems. We design a custom automatic differentiation (AutoDiff) backend that targets highly parallel architectures by extracting the second-order information in batch. When the nonlinear equations are associated to a reduced space optimization problem, we leverage the parallel reverse-mode accumulation in a batched adjoint-adjoint algorithm to compute efficiently the reduced Hessian of the problem. We apply the method to extract the reduced Hessian associated to the balance equations of a power network, and show on the largest instances that a parallel GPU implementation is 30 times faster than a sequential CPU reference based on UMFPACK.Comment: SIAM-PP2

arXiv.org e-Print Archive