8,164 research outputs found

    Format Abstraction for Sparse Tensor Algebra Compilers

    Full text link
    This paper shows how to build a sparse tensor algebra compiler that is agnostic to tensor formats (data layouts). We develop an interface that describes formats in terms of their capabilities and properties, and show how to build a modular code generator where new formats can be added as plugins. We then describe six implementations of the interface that compose to form the dense, CSR/CSF, COO, DIA, ELL, and HASH tensor formats and countless variants thereof. With these implementations at hand, our code generator can generate code to compute any tensor algebra expression on any combination of the aforementioned formats. To demonstrate our technique, we have implemented it in the taco tensor algebra compiler. Our modular code generator design makes it simple to add support for new tensor formats, and the performance of the generated code is competitive with hand-optimized implementations. Furthermore, by extending taco to support a wider range of formats specialized for different application and data characteristics, we can improve end-user application performance. For example, if input data is provided in the COO format, our technique allows computing a single matrix-vector multiplication directly with the data in COO, which is up to 3.6Ă—\times faster than by first converting the data to CSR.Comment: Presented at OOPSLA 201

    The Tensor Networks Anthology: Simulation techniques for many-body quantum lattice systems

    Full text link
    We present a compendium of numerical simulation techniques, based on tensor network methods, aiming to address problems of many-body quantum mechanics on a classical computer. The core setting of this anthology are lattice problems in low spatial dimension at finite size, a physical scenario where tensor network methods, both Density Matrix Renormalization Group and beyond, have long proven to be winning strategies. Here we explore in detail the numerical frameworks and methods employed to deal with low-dimension physical setups, from a computational physics perspective. We focus on symmetries and closed-system simulations in arbitrary boundary conditions, while discussing the numerical data structures and linear algebra manipulation routines involved, which form the core libraries of any tensor network code. At a higher level, we put the spotlight on loop-free network geometries, discussing their advantages, and presenting in detail algorithms to simulate low-energy equilibrium states. Accompanied by discussions of data structures, numerical techniques and performance, this anthology serves as a programmer's companion, as well as a self-contained introduction and review of the basic and selected advanced concepts in tensor networks, including examples of their applications.Comment: 115 pages, 56 figure

    Design and optimization of a portable LQCD Monte Carlo code using OpenACC

    Full text link
    The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core GPUs, exploiting aggressive data-parallelism and delivering higher performances for streaming computing applications. In this scenario, code portability (and performance portability) become necessary for easy maintainability of applications; this is very relevant in scientific computing where code changes are very frequent, making it tedious and prone to error to keep different code versions aligned. In this work we present the design and optimization of a state-of-the-art production-level LQCD Monte Carlo application, using the directive-based OpenACC programming model. OpenACC abstracts parallel programming to a descriptive level, relieving programmers from specifying how codes should be mapped onto the target architecture. We describe the implementation of a code fully written in OpenACC, and show that we are able to target several different architectures, including state-of-the-art traditional CPUs and GPUs, with the same code. We also measure performance, evaluating the computing efficiency of our OpenACC code on several architectures, comparing with GPU-specific implementations and showing that a good level of performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for consideration in International Journal of Modern Physics

    Conjugate gradient solvers on Intel Xeon Phi and NVIDIA GPUs

    Full text link
    Lattice Quantum Chromodynamics simulations typically spend most of the runtime in inversions of the Fermion Matrix. This part is therefore frequently optimized for various HPC architectures. Here we compare the performance of the Intel Xeon Phi to current Kepler-based NVIDIA Tesla GPUs running a conjugate gradient solver. By exposing more parallelism to the accelerator through inverting multiple vectors at the same time, we obtain a performance greater than 300 GFlop/s on both architectures. This more than doubles the performance of the inversions. We also give a short overview of the Knights Corner architecture, discuss some details of the implementation and the effort required to obtain the achieved performance.Comment: 7 pages, proceedings, presented at 'GPU Computing in High Energy Physics', September 10-12, 2014, Pisa, Ital

    Optimal network topologies: Expanders, Cages, Ramanujan graphs, Entangled networks and all that

    Full text link
    We report on some recent developments in the search for optimal network topologies. First we review some basic concepts on spectral graph theory, including adjacency and Laplacian matrices, and paying special attention to the topological implications of having large spectral gaps. We also introduce related concepts as ``expanders'', Ramanujan, and Cage graphs. Afterwards, we discuss two different dynamical feautures of networks: synchronizability and flow of random walkers and so that they are optimized if the corresponding Laplacian matrix have a large spectral gap. From this, we show, by developing a numerical optimization algorithm that maximum synchronizability and fast random walk spreading are obtained for a particular type of extremely homogeneous regular networks, with long loops and poor modular structure, that we call entangled networks. These turn out to be related to Ramanujan and Cage graphs. We argue also that these graphs are very good finite-size approximations to Bethe lattices, and provide almost or almost optimal solutions to many other problems as, for instance, searchability in the presence of congestion or performance of neural networks. Finally, we study how these results are modified when studying dynamical processes controlled by a normalized (weighted and directed) dynamics; much more heterogeneous graphs are optimal in this case. Finally, a critical discussion of the limitations and possible extensions of this work is presented.Comment: 17 pages. 11 figures. Small corrections and a new reference. Accepted for pub. in JSTA

    An Algebraic Framework for Compositional Program Analysis

    Full text link
    The purpose of a program analysis is to compute an abstract meaning for a program which approximates its dynamic behaviour. A compositional program analysis accomplishes this task with a divide-and-conquer strategy: the meaning of a program is computed by dividing it into sub-programs, computing their meaning, and then combining the results. Compositional program analyses are desirable because they can yield scalable (and easily parallelizable) program analyses. This paper presents algebraic framework for designing, implementing, and proving the correctness of compositional program analyses. A program analysis in our framework defined by an algebraic structure equipped with sequencing, choice, and iteration operations. From the analysis design perspective, a particularly interesting consequence of this is that the meaning of a loop is computed by applying the iteration operator to the loop body. This style of compositional loop analysis can yield interesting ways of computing loop invariants that cannot be defined iteratively. We identify a class of algorithms, the so-called path-expression algorithms [Tarjan1981,Scholz2007], which can be used to efficiently implement analyses in our framework. Lastly, we develop a theory for proving the correctness of an analysis by establishing an approximation relationship between an algebra defining a concrete semantics and an algebra defining an analysis.Comment: 15 page

    MILC Code Performance on High End CPU and GPU Supercomputer Clusters

    Full text link
    With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library

    Lecture Notes of Tensor Network Contractions

    Get PDF
    Tensor network (TN), a young mathematical tool of high vitality and great potential, has been undergoing extremely rapid developments in the last two decades, gaining tremendous success in condensed matter physics, atomic physics, quantum information science, statistical physics, and so on. In this lecture notes, we focus on the contraction algorithms of TN as well as some of the applications to the simulations of quantum many-body systems. Starting from basic concepts and definitions, we first explain the relations between TN and physical problems, including the TN representations of classical partition functions, quantum many-body states (by matrix product state, tree TN, and projected entangled pair state), time evolution simulations, etc. These problems, which are challenging to solve, can be transformed to TN contraction problems. We present then several paradigm algorithms based on the ideas of the numerical renormalization group and/or boundary states, including density matrix renormalization group, time-evolving block decimation, coarse-graining/corner tensor renormalization group, and several distinguished variational algorithms. Finally, we revisit the TN approaches from the perspective of multi-linear algebra (also known as tensor algebra or tensor decompositions) and quantum simulation. Despite the apparent differences in the ideas and strategies of different TN algorithms, we aim at revealing the underlying relations and resemblances in order to present a systematic picture to understand the TN contraction approaches.Comment: 134 pages, 68 figures. In this version, the manuscript has been changed into the format of book; new sections about tensor network and quantum circuits have been adde
    • …
    corecore