491 research outputs found
Finite Element Integration on GPUs
We present a novel finite element integration method for low order elements
on GPUs. We achieve more than 100GF for element integration on first order
discretizations of both the Laplacian and Elasticity operators.Comment: 16 pages, 3 figure
Simulating the weak death of the neutron in a femtoscale universe with near-Exascale computing
The fundamental particle theory called Quantum Chromodynamics (QCD) dictates
everything about protons and neutrons, from their intrinsic properties to
interactions that bind them into atomic nuclei. Quantities that cannot be fully
resolved through experiment, such as the neutron lifetime (whose precise value
is important for the existence of light-atomic elements that make the sun shine
and life possible), may be understood through numerical solutions to QCD. We
directly solve QCD using Lattice Gauge Theory and calculate nuclear observables
such as neutron lifetime. We have developed an improved algorithm that
exponentially decreases the time-to solution and applied it on the new CORAL
supercomputers, Sierra and Summit. We use run-time autotuning to distribute GPU
resources, achieving 20% performance at low node count. We also developed
optimal application mapping through a job manager, which allows CPU and GPU
jobs to be interleaved, yielding 15% of peak performance when deployed across
large fractions of CORAL.Comment: 2018 Gordon Bell Finalist: 9 pages, 9 figures; v2: fixed 2 typos and
appended acknowledgement
Tensor Contraction Layers for Parsimonious Deep Nets
Tensors offer a natural representation for many kinds of data frequently
encountered in machine learning. Images, for example, are naturally represented
as third order tensors, where the modes correspond to height, width, and
channels. Tensor methods are noted for their ability to discover
multi-dimensional dependencies, and tensor decompositions in particular, have
been used to produce compact low-rank approximations of data. In this paper, we
explore the use of tensor contractions as neural network layers and investigate
several ways to apply them to activation tensors. Specifically, we propose the
Tensor Contraction Layer (TCL), the first attempt to incorporate tensor
contractions as end-to-end trainable neural network layers. Applied to existing
networks, TCLs reduce the dimensionality of the activation tensors and thus the
number of model parameters. We evaluate the TCL on the task of image
recognition, augmenting two popular networks (AlexNet, VGG). The resulting
models are trainable end-to-end. Applying the TCL to the task of image
recognition, using the CIFAR100 and ImageNet datasets, we evaluate the effect
of parameter reduction via tensor contraction on performance. We demonstrate
significant model compression without significant impact on the accuracy and,
in some cases, improved performance
Efficient Quantum Circuit Simulation by Tensor Network Methods on Modern GPUs
Efficient simulation of quantum circuits has become indispensable with the
rapid development of quantum hardware. The primary simulation methods are based
on state vectors and tensor networks. As the number of qubits and quantum gates
grows larger in current quantum devices, traditional state-vector based quantum
circuit simulation methods prove inadequate due to the overwhelming size of the
Hilbert space and extensive entanglement. Consequently, brutal force tensor
network simulation algorithms become the only viable solution in such
scenarios. The two main challenges faced in tensor network simulation
algorithms are optimal contraction path finding and efficient execution on
modern computing devices, with the latter determines the actual efficiency. In
this study, we investigate the optimization of such tensor network simulations
on modern GPUs and propose general optimization strategies from two aspects:
computational efficiency and accuracy. Firstly, we propose to transform
critical Einstein summation operations into GEMM operations, leveraging the
specific features of tensor network simulations to amplify the efficiency of
GPUs. Secondly, by analyzing the data characteristics of quantum circuits, we
employ extended precision to ensure the accuracy of simulation results and
mixed precision to fully exploit the potential of GPUs, resulting in faster and
more precise simulations. Our numerical experiments demonstrate that our
approach can achieve a 3.96x reduction in verification time for random quantum
circuit samples in the 18-cycle case of Sycamore, with sustained performance
exceeding 21 TFLOPS on one A100. This method can be easily extended to the
20-cycle case, maintaining the same performance, accelerating by 12.5x compared
to the state-of-the-art CPU-based results and 4.48-6.78x compared to the
state-of-the-art GPU-based results reported in the literature.Comment: 25 pages, 10 figure
Tensor Regression Networks
Convolutional neural networks typically consist of many convolutional layers
followed by one or more fully connected layers. While convolutional layers map
between high-order activation tensors, the fully connected layers operate on
flattened activation vectors. Despite empirical success, this approach has
notable drawbacks. Flattening followed by fully connected layers discards
multilinear structure in the activations and requires many parameters. We
address these problems by incorporating tensor algebraic operations that
preserve multilinear structure at every layer. First, we introduce Tensor
Contraction Layers (TCLs) that reduce the dimensionality of their input while
preserving their multilinear structure using tensor contraction. Next, we
introduce Tensor Regression Layers (TRLs), which express outputs through a
low-rank multilinear mapping from a high-order activation tensor to an output
tensor of arbitrary order. We learn the contraction and regression factors
end-to-end, and produce accurate nets with fewer parameters. Additionally, our
layers regularize networks by imposing low-rank constraints on the activations
(TCL) and regression weights (TRL). Experiments on ImageNet show that, applied
to VGG and ResNet architectures, TCLs and TRLs reduce the number of parameters
compared to fully connected layers by more than 65% while maintaining or
increasing accuracy. In addition to the space savings, our approach's ability
to leverage topological structure can be crucial for structured data such as
MRI. In particular, we demonstrate significant performance improvements over
comparable architectures on three tasks associated with the UK Biobank dataset
A Unified Optimization Approach for Sparse Tensor Operations on GPUs
Sparse tensors appear in many large-scale applications with multidimensional
and sparse data. While multidimensional sparse data often need to be processed
on manycore processors, attempts to develop highly-optimized GPU-based
implementations of sparse tensor operations are rare. The irregular computation
patterns and sparsity structures as well as the large memory footprints of
sparse tensor operations make such implementations challenging. We leverage the
fact that sparse tensor operations share similar computation patterns to
propose a unified tensor representation called F-COO. Combined with
GPU-specific optimizations, F-COO provides highly-optimized implementations of
sparse tensor computations on GPUs. The performance of the proposed unified
approach is demonstrated for tensor-based kernels such as the Sparse Matricized
Tensor- Times-Khatri-Rao Product (SpMTTKRP) and the Sparse Tensor- Times-Matrix
Multiply (SpTTM) and is used in tensor decomposition algorithms. Compared to
state-of-the-art work we improve the performance of SpTTM and SpMTTKRP up to
3.7 and 30.6 times respectively on NVIDIA Titan-X GPUs. We implement a
CANDECOMP/PARAFAC (CP) decomposition and achieve up to 14.9 times speedup using
the unified method over state-of-the-art libraries on NVIDIA Titan-X GPUs
- …