56,809 research outputs found
Topological Optimization of the Evaluation of Finite Element Matrices
We present a topological framework for finding low-flop algorithms for
evaluating element stiffness matrices associated with multilinear forms for
finite element methods posed over straight-sided affine domains. This framework
relies on phrasing the computation on each element as the contraction of each
collection of reference element tensors with an element-specific geometric
tensor. We then present a new concept of complexity-reducing relations that
serve as distance relations between these reference element tensors. This
notion sets up a graph-theoretic context in which we may find an optimized
algorithm by computing a minimum spanning tree. We present experimental results
for some common multilinear forms showing significant reductions in operation
count and also discuss some efficient algorithms for building the graph we use
for the optimization
NVIDIA Tensor Core Programmability, Performance & Precision
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called
"Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices
per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta
microarchitecture, provides 640 Tensor Cores with a theoretical peak
performance of 125 Tflops/s in mixed precision. In this paper, we investigate
current approaches to program NVIDIA Tensor Cores, their performances and the
precision loss due to computation in mixed precision.
Currently, NVIDIA provides three different ways of programming
matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply
Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS
GEMM. After experimenting with different approaches, we found that NVIDIA
Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100
GPU, seven and three times the performance in single and half precision
respectively. A WMMA implementation of batched GEMM reaches a performance of 4
Tflops/s. While precision loss due to matrix multiplication with half precision
input might be critical in many HPC applications, it can be considerably
reduced at the cost of increased computation. Our results indicate that HPC
applications using matrix multiplications can strongly benefit from using of
NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on
Accelerators and Hybrid Exascale Systems (AsHES) 201
Deflated GMRES for Systems with Multiple Shifts and Multiple Right-Hand Sides
We consider solution of multiply shifted systems of nonsymmetric linear
equations, possibly also with multiple right-hand sides. First, for a single
right-hand side, the matrix is shifted by several multiples of the identity.
Such problems arise in a number of applications, including lattice quantum
chromodynamics where the matrices are complex and non-Hermitian. Some Krylov
iterative methods such as GMRES and BiCGStab have been used to solve multiply
shifted systems for about the cost of solving just one system. Restarted GMRES
can be improved by deflating eigenvalues for matrices that have a few small
eigenvalues. We show that a particular deflated method, GMRES-DR, can be
applied to multiply shifted systems. In quantum chromodynamics, it is common to
have multiple right-hand sides with multiple shifts for each right-hand side.
We develop a method that efficiently solves the multiple right-hand sides by
using a deflated version of GMRES and yet keeps costs for all of the multiply
shifted systems close to those for one shift. An example is given showing this
can be extremely effective with a quantum chromodynamics matrix.Comment: 19 pages, 9 figure
GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU
High-performance implementations of graph algorithms are challenging to
implement on new parallel hardware such as GPUs because of three challenges:
(1) the difficulty of coming up with graph building blocks, (2) load imbalance
on parallel hardware, and (3) graph problems having low arithmetic intensity.
To address some of these challenges, GraphBLAS is an innovative, on-going
effort by the graph analytics community to propose building blocks based on
sparse linear algebra, which will allow graph algorithms to be expressed in a
performant, succinct, composable and portable manner. In this paper, we examine
the performance challenges of a linear-algebra-based approach to building graph
frameworks and describe new design principles for overcoming these bottlenecks.
Among the new design principles is exploiting input sparsity, which allows
users to write graph algorithms without specifying push and pull direction.
Exploiting output sparsity allows users to tell the backend which values of the
output in a single vectorized computation they do not want computed.
Load-balancing is an important feature for balancing work amongst parallel
workers. We describe the important load-balancing features for handling graphs
with different characteristics. The design principles described in this paper
have been implemented in "GraphBLAST", the first high-performance linear
algebra-based graph framework on NVIDIA GPUs that is open-source. The results
show that on a single GPU, GraphBLAST has on average at least an order of
magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL,
comparable performance to the fastest GPU hardwired primitives and
shared-memory graph frameworks Ligra and Gunrock, and better performance than
any other GPU graph framework, while offering a simpler and more concise
programming model.Comment: 50 pages, 14 figures, 14 table
Asymmetric Leakage from Multiplier and Collision-Based Single-Shot Side-Channel Attack
The single-shot collision attack on RSA proposed by Hanley et al. is studied focusing on the difference between two operands of multiplier. It is shown that how leakage from integer multiplier and long-integer multiplication algorithm can be asymmetric between two operands. The asymmetric leakage is verified with experiments on FPGA and micro-controller platforms. Moreover, we show an experimental result in which success and failure of the attack is determined by the order of operands. Therefore, designing operand order can be a cost-effective countermeasure. Meanwhile we also show a case in which a particular countermeasure becomes ineffective when the asymmetric leakage is considered. In addition to the above main contribution, an extension of the attack by Hanley et al. using the signal-processing technique of Big Mac Attack is presented
Parametrizing Complex Hadamard Matrices
The purpose of this paper is to introduce new parametric families of complex
Hadamard matrices in two different ways. First, we prove that every real
Hadamard matrix of order N>=4 admits an affine orbit. This settles a recent
open problem of Tadej and Zyczkowski, who asked whether a real Hadamard matrix
can be isolated among complex ones. In particular, we apply our construction to
the only (up to equivalence) real Hadamard matrix of order 12 and show that the
arising affine family is different from all previously known examples. Second,
we recall a well-known construction related to real conference matrices, and
show how to introduce an affine parameter in the arising complex Hadamard
matrices. This leads to new parametric families of orders 10 and 14. An
interesting feature of both of our constructions is that the arising families
cannot be obtained via Dita's general method. Our results extend the recent
catalogue of complex Hadamard matrices, and may lead to direct applications in
quantum-information theory.Comment: 16 pages; Final version. Submitted to: European Journal of
Combinatoric
- âŠ