18 research outputs found

    Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends

    Full text link
    Coupled-cluster methods provide highly accurate models of molecular structure through explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrix–matrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts to extend the Libtensor framework to work in the distributed memory environment in a scalable and energy-efficient manner. We achieve up to 240Ă— speedup compared with the optimized shared memory implementation of Libtensor. We attain scalability to hundreds of thousands of compute cores on three distributed-memory architectures (Cray XC30 and XC40, and IBM Blue Gene/Q), and on a heterogeneous GPU-CPU system (Cray XK7). As the bottlenecks shift from being compute-bound DGEMM's to communication-bound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling load-imbalance, tasking and bulk synchronous models. Nevertheless, we preserve a unified interface to both programming models to maintain the productivity of computational quantum chemists

    Automated cache optimisations of stencil computations for partial differential equations

    Get PDF
    This thesis focuses on numerical methods that solve partial differential equations. Our focal point is the finite difference method, which solves partial differential equations by approximating derivatives with explicit finite differences. These partial differential equation solvers consist of stencil computations on structured grids. Stencils for computing real-world practical applications are patterns often characterised by many memory accesses and non-trivial arithmetic expressions that lead to high computational costs compared to simple stencils used in much prior proof-of-concept work. In addition, the loop nests to express stencils on structured grids may often be complicated. This work is highly motivated by a specific domain of stencil computations where one of the challenges is non-aligned to the structured grid ("off-the-grid") operations. These operations update neighbouring grid points through scatter and gather operations via non-affine memory accesses, such as {A[B[i]]}. In addition to this challenge, these practical stencils often include many computation fields (need to store multiple grid copies), complex data dependencies and imperfect loop nests. In this work, we aim to increase the performance of stencil kernel execution. We study automated cache-memory-dependent optimisations for stencil computations. This work consists of two core parts with their respective contributions.The first part of our work tries to reduce the data movement in stencil computations of practical interest. Data movement is a dominant factor affecting the performance of high-performance computing applications. It has long been a target of optimisations due to its impact on execution time and energy consumption. This thesis tries to relieve this cost by applying temporal blocking optimisations, also known as time-tiling, to stencil computations. Temporal blocking is a well-known technique to enhance data reuse in stencil computations. However, it is rarely used in practical applications but rather in theoretical examples to prove its efficacy. Applying temporal blocking to scientific simulations is more complex. More specifically, in this work, we focus on the application context of seismic and medical imaging. In this area, we often encounter scatter and gather operations due to signal sources and receivers at arbitrary locations in the computational domain. These operations make the application of temporal blocking challenging. We present an approach to overcome this challenge and successfully apply temporal blocking.In the second part of our work, we extend the first part as an automated approach targeting a wide range of simulations modelled with partial differential equations. Since temporal blocking is error-prone, tedious to apply by hand and highly complex to assimilate theoretically and practically, we are motivated to automate its application and automatically generate code that benefits from it. We discuss algorithmic approaches and present a generalised compiler pipeline to automate the application of temporal blocking. These passes are written in the Devito compiler. They are used to accelerate the computation of stencil kernels in areas such as seismic and medical imaging, computational fluid dynamics and machine learning. \href{www.devitoproject.org}{Devito} is a Python package to implement optimised stencil computation (e.g., finite differences, image processing, machine learning) from high-level symbolic problem definitions. Devito builds on \href{www.sympy.org}{SymPy} and employs automated code generation and just-in-time compilation to execute optimised computational kernels on several computer platforms, including CPUs, GPUs, and clusters thereof. We show how we automate temporal blocking code generation without user intervention and often achieve better time-to-solution. We enable domain-specific optimisation through compiler passes and offer temporal blocking gains from a high-level symbolic abstraction. These automated optimisations benefit various computational kernels for solving real-world application problems.Open Acces

    Quantum-Inspired Machine Learning: a Survey

    Full text link
    Quantum-inspired Machine Learning (QiML) is a burgeoning field, receiving global attention from researchers for its potential to leverage principles of quantum mechanics within classical computational frameworks. However, current review literature often presents a superficial exploration of QiML, focusing instead on the broader Quantum Machine Learning (QML) field. In response to this gap, this survey provides an integrated and comprehensive examination of QiML, exploring QiML's diverse research domains including tensor network simulations, dequantized algorithms, and others, showcasing recent advancements, practical applications, and illuminating potential future research avenues. Further, a concrete definition of QiML is established by analyzing various prior interpretations of the term and their inherent ambiguities. As QiML continues to evolve, we anticipate a wealth of future developments drawing from quantum mechanics, quantum computing, and classical machine learning, enriching the field further. This survey serves as a guide for researchers and practitioners alike, providing a holistic understanding of QiML's current landscape and future directions.Comment: 56 pages, 13 figures, 8 table

    Code Generation for High Performance PDE Solvers on Modern Architectures

    Get PDF
    Numerical simulation with partial differential equations is an important discipline in high performance computing. Notable application areas include geosciences, fluid dynamics, solid mechanics and electromagnetics. Recent hardware developments have made it increasingly hard to achieve very good performance. This is both due to a lack of numerical algorithms suited for the hardware and efficient implementations of these algorithms not being available. Modern CPUs require a sufficiently high arithmetic intensity in order to unfold their full potential. In this thesis, we use a numerical scheme that is well-suited for this scenario: The Discontinuous Galerkin Finite Element Method on cuboid meshes can be implemented with optimal complexity exploiting the tensor product structure of basis functions and quadrature formulae using a technique called sum factorization. A matrix-free implementation of this scheme significantly lowers the memory footprint of the method and delivers a fully compute-bound algorithm. An efficient implementation of this scheme for a modern CPU requires maximum use of the processor’s SIMD units. General purpose compilers are not capable of autovectorizing traditional PDE simulation codes, requiring high performance implementations to explicitly spell out SIMD instructions. With the SIMD width increasing in the last years (reaching its current peak at 512 bits in the Intel Skylake architecture) and programming languages not providing tools to directly target SIMD units, such code suffers from a performance portability issue. This work proposes generative programming as a solution to this issue. To this end, we develop a toolchain that translates a PDE problem expressed in a domain specific language into a piece of machine-dependent, optimized C++ code. This toolchain is embedded into the existing user workflow of the DUNE project, an open source framework for the numerical solution of PDEs. Compared to other such toolchains, special emphasis is put on an intermediate representation that enables performance-oriented transformations. Furthermore, this thesis defines a new class of SIMD vectorization strategies that operate on batches of subkernels within one integration kernel. The space of these vectorization strategies is explored systematically from within the code generator in an autotuning procedure. We demonstrate the performance of our vectorization strategies and their implementation by providing measurements on the Intel Haswell and Intel Skylake architectures. We present numbers for the diffusion-reaction equation, the Stokes equations and Maxwell’s equations, achieving up to 40% of the machine’s theoretical floating point performance for an application of the DG operator

    Optimizing AI at the Edge: from network topology design to MCU deployment

    Get PDF
    The first topic analyzed in the thesis will be Neural Architecture Search (NAS). I will focus on two different tools that I developed, one to optimize the architecture of Temporal Convolutional Networks (TCNs), a convolutional model for time-series processing that has recently emerged, and one to optimize the data precision of tensors inside CNNs. The first NAS proposed explicitly targets the optimization of the most peculiar architectural parameters of TCNs, namely dilation, receptive field, and the number of features in each layer. Note that this is the first NAS that explicitly targets these networks. The second NAS proposed instead focuses on finding the most efficient data format for a target CNN, with the granularity of the layer filter. Note that applying these two NASes in sequence allows an "application designer" to minimize the structure of the neural network employed, minimizing the number of operations or the memory usage of the network. After that, the second topic described is the optimization of neural network deployment on edge devices. Importantly, exploiting edge platforms' scarce resources is critical for NN efficient execution on MCUs. To do so, I will introduce DORY (Deployment Oriented to memoRY) -- an automatic tool to deploy CNNs on low-cost MCUs. DORY, in different steps, can manage different levels of memory inside the MCU automatically, offload the computation workload (i.e., the different layers of a neural network) to dedicated hardware accelerators, and automatically generates ANSI C code that orchestrates off- and on-chip transfers with the computation phases. On top of this, I will introduce two optimized computation libraries that DORY can exploit to deploy TCNs and Transformers on edge efficiently. I conclude the thesis with two different applications on bio-signal analysis, i.e., heart rate tracking and sEMG-based gesture recognition

    Variational Quantum Simulations of Lattice Gauge Theories

    Get PDF
    Simulationen von Gittereichtheorien spielen eine grundlegende Rolle bei First-Principles-Rechnungen im Kontext der Hochenergiephysik. Diese Arbeit zielt darauf ab, aktuelle Simulationsmethoden für First-Principle-Berechnungen zu verbessern und diese Methoden auf relevante physikalische Modelle anzuwenden. Wir gehen dieses Problem mit drei verschiedenen Ansätzen an: maschinelles Lernen, Quantencomputing und Tensornetzwerke. Im Rahmen des maschinellen Lernens haben wir eine Methode zur Schätzung thermodynamischer Observablen in Gitterfeldtheorien entwickelt. Genauer gesagt verwenden wir tiefe generative Modelle, um den absoluten Wert der freien Energie abzuschätzen. Wir haben die Anwendbarkeit unserer Methode durch die Untersuchung eines Spielzeugmodells demonstriert. Unser Ansatz erzeugt genauere Messungen im Vergleich mit dem Standard-Markov-Ketten-Monte-Carlo-Verfahren, wenn wir einen Phasenübergangspunkt überqueren. Im Kontext des Quantencomputings ist es unser Ziel, die aktuellen Algorithmen für Quantensimulationen zu verbessern. In dieser Arbeit haben wir uns mit zwei Themen moderner Quantencomputer befasst: der Quantenrauschunterdrückung und dem Design guter parametrischer Quantenschaltkreise. Wir haben eine Minderungsroutine zum Auslesen von Bit-Flip-Fehlern entwickelt, die Quantensimulationen drastisch verbessern kann. Wir haben auch eine dimensionale Aussagekraftanalyse entwickelt, die überflüssige Parameter in parametrischen Quantenschaltkreisen identifizieren kann. Darüber hinaus zeigen wir, wie man Expressivitätsanalysen mit Quantenhardware effizient umsetzen kann. Im Kontext des Tensornetzwerks haben wir ein Quantenbindungsmodell U(1) und 2+1-Dimensionen in einer Leitergeometrie mit DMRG untersucht. Unser Ziel ist es, die Eigenschaften des Grundzustands des Modells in einem endlichen chemischen Potential zu analysieren. Wir haben unterschiedliche Windungszahlsektoren beobachtet, als wir chemisches Potential in das System eingebracht haben.Simulations of lattice gauge theories play a fundamental role in first principles calculations in the context of high energy physics. This thesis aims to improve state-of-the-art simulation methods for first-principle calculations and apply those methods to relevant physical models. We address this problem using three different approaches: machine learning, quantum computing, and tensor networks. In the context of machine learning, we have developed a method to estimate thermodynamic observables in lattice field theories. More precisely, we use deep generative models to estimate the absolute value of the free energy. We have demonstrated the applicability of our method by studying a toy model. Our approach produces more precise measurements in comparison with the standard Markov chain Monte Carlo method when we cross a phase transition point. In the context of quantum computing, our goal is to improve the current algorithms for quantum simulations. In this thesis, we have addressed two issues on modern quantum computers: the quantum noise mitigation and the design of good parametric quantum circuits. We have developed a mitigation routine ffor read-out bit-flip errors that can drastically improve quantum simulations. We have also developed a dimensional expressiveness analysis that can identify superfluous parameters in parametric quantum circuits. In addition, we show how to implement expressivity analysis using quantum hardware efficiently. In the context of the tensor network, we have studied a quantum bond model U(1) and 2+1 dimensions in a ladder geometry with DMRG. Our goal is to analyze the properties of the ground state of the model in a finite chemical potential. We have observed different winding number sectors when we have introduced chemical potential in the system

    Quantum Chemistry in the Age of Quantum Computing

    Full text link
    Practical challenges in simulating quantum systems on classical computers have been widely recognized in the quantum physics and quantum chemistry communities over the past century. Although many approximation methods have been introduced, the complexity of quantum mechanics remains hard to appease. The advent of quantum computation brings new pathways to navigate this challenging complexity landscape. By manipulating quantum states of matter and taking advantage of their unique features such as superposition and entanglement, quantum computers promise to efficiently deliver accurate results for many important problems in quantum chemistry such as the electronic structure of molecules. In the past two decades significant advances have been made in developing algorithms and physical hardware for quantum computing, heralding a revolution in simulation of quantum systems. This article is an overview of the algorithms and results that are relevant for quantum chemistry. The intended audience is both quantum chemists who seek to learn more about quantum computing, and quantum computing researchers who would like to explore applications in quantum chemistry

    Numerical scalar curvature deformation and a gluing construction

    Get PDF
    In this work a new numerical technique to prepare Cauchy data for the initial value problem (IVP) formulation of Einstein's field equations (EFE) is presented. Our method is directly inspired by the exterior asymptotic gluing (EAG) result of Corvino (2000). The argument assumes a moment in time symmetry and allows for a composite, initial data set to be assembled from (a finite subdomain of) a known asymptotically Euclidean initial data set which is glued (in a controlled manner) over a compact spatial region to an exterior Schwarzschildean representative. We demonstrate how (Corvino, 2000) may be directly adapted to a numerical scheme and under the assumption of axisymmetry construct composite Hamiltonian constraint satisfying initial data featuring internal binary black holes (BBH) glued to exterior Schwarzschild initial data in isotropic form. The generality of the method is shown in a comparison of properties of EAG composite initial data sets featuring internal BBHs as modelled by Brill-Lindquist and Misner data. The underlying geometric analysis character of gluing methods requires work within suitably weighted function spaces, which, together with a technical impediment preventing (Corvino, 2000) from being fully constructive, is the principal difficulty in devising a numerical technique. Thus the single previous attempt by Giulini and Holzegel (2005) (recently implemented by Doulis and Rinne (2016)) sought to avoid this by embedding the result within the well known Lichnerowicz-York conformal framework which required ad-hoc assumptions on solution form and a formal perturbative argument to show that EAG may proceed. In (Giulini and Holzegel, 2005) it was further claimed that judicious engineering of EAG can serve to reduce the presence of spurious gravitational radiation - unfortunately, in line with the general conclusion of (Doulis and Rinne, 2016) our numerical investigation does not appear to indicate that this is the case. Concretising the sought initial data to be specified with respect to a spatial manifold with underlying topology R×S² our method exploits a variety of pseudo-spectral (PS) techniques. A combination of the eth-formalism and spin-weighted spherical harmonics together with a novel complex-analytic based numerical approach is utilised. This is enabled by our Python 3 based numerical toolkit allowing for unified just-in-time compiled, distributed calculations with seamless extension to arbitrary precision for problems involving generic, geometric partial differential equations (PDE) as specified by tensorial expressions. Additional features include a layer of abstraction that allows for automatic reduction of indicial (i.e., tensorial) expressions together with grid remapping based on chart specification - hence straight-forward implementation of IVP formulations of the EFE such as ADM-York or ADM-York-NOR is possible. Code-base verification is performed by evolving the polarised Gowdy T³ space-time with the above formulations utilising high order, explicit time-integrators in the method of lines approach as combined with PS techniques. As the initial data we prepare has a precise (Schwarzschild) exterior this may be of interest to global evolution schemes that incorporate information from spatial-infinity. Furthermore, our approach may shed light on how more general gluing techniques could potentially be adapted for numerical work. The code-base we have developed may also be of interest in application to other problems involving geometric PDEs
    corecore