8,164 research outputs found
Format Abstraction for Sparse Tensor Algebra Compilers
This paper shows how to build a sparse tensor algebra compiler that is
agnostic to tensor formats (data layouts). We develop an interface that
describes formats in terms of their capabilities and properties, and show how
to build a modular code generator where new formats can be added as plugins. We
then describe six implementations of the interface that compose to form the
dense, CSR/CSF, COO, DIA, ELL, and HASH tensor formats and countless variants
thereof. With these implementations at hand, our code generator can generate
code to compute any tensor algebra expression on any combination of the
aforementioned formats.
To demonstrate our technique, we have implemented it in the taco tensor
algebra compiler. Our modular code generator design makes it simple to add
support for new tensor formats, and the performance of the generated code is
competitive with hand-optimized implementations. Furthermore, by extending taco
to support a wider range of formats specialized for different application and
data characteristics, we can improve end-user application performance. For
example, if input data is provided in the COO format, our technique allows
computing a single matrix-vector multiplication directly with the data in COO,
which is up to 3.6 faster than by first converting the data to CSR.Comment: Presented at OOPSLA 201
The Tensor Networks Anthology: Simulation techniques for many-body quantum lattice systems
We present a compendium of numerical simulation techniques, based on tensor
network methods, aiming to address problems of many-body quantum mechanics on a
classical computer. The core setting of this anthology are lattice problems in
low spatial dimension at finite size, a physical scenario where tensor network
methods, both Density Matrix Renormalization Group and beyond, have long proven
to be winning strategies. Here we explore in detail the numerical frameworks
and methods employed to deal with low-dimension physical setups, from a
computational physics perspective. We focus on symmetries and closed-system
simulations in arbitrary boundary conditions, while discussing the numerical
data structures and linear algebra manipulation routines involved, which form
the core libraries of any tensor network code. At a higher level, we put the
spotlight on loop-free network geometries, discussing their advantages, and
presenting in detail algorithms to simulate low-energy equilibrium states.
Accompanied by discussions of data structures, numerical techniques and
performance, this anthology serves as a programmer's companion, as well as a
self-contained introduction and review of the basic and selected advanced
concepts in tensor networks, including examples of their applications.Comment: 115 pages, 56 figure
Design and optimization of a portable LQCD Monte Carlo code using OpenACC
The present panorama of HPC architectures is extremely heterogeneous, ranging
from traditional multi-core CPU processors, supporting a wide class of
applications but delivering moderate computing performance, to many-core GPUs,
exploiting aggressive data-parallelism and delivering higher performances for
streaming computing applications. In this scenario, code portability (and
performance portability) become necessary for easy maintainability of
applications; this is very relevant in scientific computing where code changes
are very frequent, making it tedious and prone to error to keep different code
versions aligned. In this work we present the design and optimization of a
state-of-the-art production-level LQCD Monte Carlo application, using the
directive-based OpenACC programming model. OpenACC abstracts parallel
programming to a descriptive level, relieving programmers from specifying how
codes should be mapped onto the target architecture. We describe the
implementation of a code fully written in OpenACC, and show that we are able to
target several different architectures, including state-of-the-art traditional
CPUs and GPUs, with the same code. We also measure performance, evaluating the
computing efficiency of our OpenACC code on several architectures, comparing
with GPU-specific implementations and showing that a good level of
performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for
consideration in International Journal of Modern Physics
Conjugate gradient solvers on Intel Xeon Phi and NVIDIA GPUs
Lattice Quantum Chromodynamics simulations typically spend most of the
runtime in inversions of the Fermion Matrix. This part is therefore frequently
optimized for various HPC architectures. Here we compare the performance of the
Intel Xeon Phi to current Kepler-based NVIDIA Tesla GPUs running a conjugate
gradient solver. By exposing more parallelism to the accelerator through
inverting multiple vectors at the same time, we obtain a performance greater
than 300 GFlop/s on both architectures. This more than doubles the performance
of the inversions. We also give a short overview of the Knights Corner
architecture, discuss some details of the implementation and the effort
required to obtain the achieved performance.Comment: 7 pages, proceedings, presented at 'GPU Computing in High Energy
Physics', September 10-12, 2014, Pisa, Ital
Optimal network topologies: Expanders, Cages, Ramanujan graphs, Entangled networks and all that
We report on some recent developments in the search for optimal network
topologies. First we review some basic concepts on spectral graph theory,
including adjacency and Laplacian matrices, and paying special attention to the
topological implications of having large spectral gaps. We also introduce
related concepts as ``expanders'', Ramanujan, and Cage graphs. Afterwards, we
discuss two different dynamical feautures of networks: synchronizability and
flow of random walkers and so that they are optimized if the corresponding
Laplacian matrix have a large spectral gap. From this, we show, by developing a
numerical optimization algorithm that maximum synchronizability and fast random
walk spreading are obtained for a particular type of extremely homogeneous
regular networks, with long loops and poor modular structure, that we call
entangled networks. These turn out to be related to Ramanujan and Cage graphs.
We argue also that these graphs are very good finite-size approximations to
Bethe lattices, and provide almost or almost optimal solutions to many other
problems as, for instance, searchability in the presence of congestion or
performance of neural networks. Finally, we study how these results are
modified when studying dynamical processes controlled by a normalized (weighted
and directed) dynamics; much more heterogeneous graphs are optimal in this
case. Finally, a critical discussion of the limitations and possible extensions
of this work is presented.Comment: 17 pages. 11 figures. Small corrections and a new reference. Accepted
for pub. in JSTA
An Algebraic Framework for Compositional Program Analysis
The purpose of a program analysis is to compute an abstract meaning for a
program which approximates its dynamic behaviour. A compositional program
analysis accomplishes this task with a divide-and-conquer strategy: the meaning
of a program is computed by dividing it into sub-programs, computing their
meaning, and then combining the results. Compositional program analyses are
desirable because they can yield scalable (and easily parallelizable) program
analyses.
This paper presents algebraic framework for designing, implementing, and
proving the correctness of compositional program analyses. A program analysis
in our framework defined by an algebraic structure equipped with sequencing,
choice, and iteration operations. From the analysis design perspective, a
particularly interesting consequence of this is that the meaning of a loop is
computed by applying the iteration operator to the loop body. This style of
compositional loop analysis can yield interesting ways of computing loop
invariants that cannot be defined iteratively. We identify a class of
algorithms, the so-called path-expression algorithms [Tarjan1981,Scholz2007],
which can be used to efficiently implement analyses in our framework. Lastly,
we develop a theory for proving the correctness of an analysis by establishing
an approximation relationship between an algebra defining a concrete semantics
and an algebra defining an analysis.Comment: 15 page
MILC Code Performance on High End CPU and GPU Supercomputer Clusters
With recent developments in parallel supercomputing architecture, many core,
multi-core, and GPU processors are now commonplace, resulting in more levels of
parallelism, memory hierarchy, and programming complexity. It has been
necessary to adapt the MILC code to these new processors starting with NVIDIA
GPUs, and more recently, the Intel Xeon Phi processors. We report on our
efforts to port and optimize our code for the Intel Knights Landing
architecture. We consider performance of the MILC code with MPI and OpenMP, and
optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on
the staggered conjugate gradient and gauge force. We also consider performance
on recent NVIDIA GPUs using the QUDA library
Lecture Notes of Tensor Network Contractions
Tensor network (TN), a young mathematical tool of high vitality and great
potential, has been undergoing extremely rapid developments in the last two
decades, gaining tremendous success in condensed matter physics, atomic
physics, quantum information science, statistical physics, and so on. In this
lecture notes, we focus on the contraction algorithms of TN as well as some of
the applications to the simulations of quantum many-body systems. Starting from
basic concepts and definitions, we first explain the relations between TN and
physical problems, including the TN representations of classical partition
functions, quantum many-body states (by matrix product state, tree TN, and
projected entangled pair state), time evolution simulations, etc. These
problems, which are challenging to solve, can be transformed to TN contraction
problems. We present then several paradigm algorithms based on the ideas of the
numerical renormalization group and/or boundary states, including density
matrix renormalization group, time-evolving block decimation,
coarse-graining/corner tensor renormalization group, and several distinguished
variational algorithms. Finally, we revisit the TN approaches from the
perspective of multi-linear algebra (also known as tensor algebra or tensor
decompositions) and quantum simulation. Despite the apparent differences in the
ideas and strategies of different TN algorithms, we aim at revealing the
underlying relations and resemblances in order to present a systematic picture
to understand the TN contraction approaches.Comment: 134 pages, 68 figures. In this version, the manuscript has been
changed into the format of book; new sections about tensor network and
quantum circuits have been adde
- …