Search CORE

21 research outputs found

Master of Science

Author: Rivera Axel Y.
Publication venue: University of Utah
Publication date: 01/12/2014
Field of study

thesisTensors are mathematical representations of physical entities that have magnitude with multiple directions. Tensor contraction is a form of creating these objects using the Einstein summation equation. It is commonly used in physics and chemistry for solving problems like spectral elements and coupled cluster computation. Mathematically, tensor contraction operations can be reduced to expressions similar to matrix multiplications. However, linear algebra libraries (e.g., BLAS and LAPACK) perform poorly on the small matrix sizes that commonly arise in certain tensor contraction computations. Another challenge seen in the computation of tensor contraction is the dierence between the mathematical representation and an ecient implementation. This thesis proposes a framework that allows users to express a tensor contraction problem in a high-level mathematical representation and transform it into a linear algebra expression that is mapped to a high-performance implementation. The framework produces code that takes advantage of the parallelism that graphics processing units (GPUs) provide. It relies on autotuning to nd the preferred implementation that achieves high performance on the available device. Performance results from the benchmarks tested, nekbone and NWChem, show that the output of the framework achieves a speedup of 8.56x and 14.25x, respectively, on an NVIDIA Tesla C2050 GPU against the sequential version; while using an NVIDIA Tesla K20c GPU it achieved speedups of 8.87x and 17.62x. The parallel decompositions found by the tool were also tested with an OpenACC implementation and achieved a speedup of 8.87x and 10.42x for nekbone, while NWChem obtained a speedup of 7.25x and 10.34x compared to the choices made by default in the OpenACC compiler. The contributions of this work are: (1) a simplied interface that allows the user to express tensor contraction using a high-level representation and transform it into high-performance code; (2) a decision algorithm that explores a set of optimization strategies for achieving performance; and, (3) a demonstration that this approach can achieve better performance than OpenACC and can be used to accelerate OpenACC

The University of Utah: J. Willard Marriott Digital Library

Search-based Model-driven Loop Optimizations for Tensor Contractions

Author: Panyala Ajay
Publication venue: LSU Digital Commons
Publication date: 01/01/2014
Field of study

Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the coupled cluster method. The Tensor Contraction Engine (TCE) is a high-level program synthesis system that facilitates the generation of high-performance parallel programs from tensor contraction equations. We are developing a new software infrastructure for the TCE that is designed to allow experimentation with optimization algorithms for modern computing platforms, including for heterogeneous architectures employing general-purpose graphics processing units (GPGPUs). In this dissertation, we present improvements and extensions to the loop fusion optimization algorithm, which can be used with cost models, e.g., for minimizing memory usage or for minimizing data movement costs under a memory constraint. We show that our data structure and pruning improvements to the loop fusion algorithm result in significant performance improvements that enable complex cost models being use for large input equations. We also present an algorithm for optimizing the fused loop structure of handwritten code. It determines the regions in handwritten code that are safe to be optimized and then runs the loop fusion algorithm on the dependency graph of the code. Finally, we develop an optimization framework for generating GPGPU code consisting of loop fusion optimization with a novel cost model, tiling optimization, and layout optimization. Depending on the memory available on the GPGPU and the sizes of the tensors, our framework decides which processor (CPU or GPGPU) should perform an operation and where the result should be moved. We present extensive measurements for tuning the loop fusion algorithm, for validating our optimization framework, and for measuring the performance characteristics of GPGPUs. Our measurements demonstrate that our optimization framework outperforms existing general-purpose optimization approaches both on multi-core CPUs and on GPGPUs

Louisiana State University

Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends

Author: Abrahams
Anna I. Krylov
Augonnet
Epifanovsky
Epifanovsky
Evgeny Epifanovsky
Ghosh
Hirata
Hudak
Irony
Kaiser
Kale
Khaled Z. Ibrahim
Nieplocha
Samuel Williams
Shao
Solomonik
Valentini
Valiev
Van de Geijn
Publication venue: 'Elsevier BV'
Publication date: 01/08/2017
Field of study

Coupled-cluster methods provide highly accurate models of molecular structure through explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrix–matrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts to extend the Libtensor framework to work in the distributed memory environment in a scalable and energy-efficient manner. We achieve up to 240× speedup compared with the optimized shared memory implementation of Libtensor. We attain scalability to hundreds of thousands of compute cores on three distributed-memory architectures (Cray XC30 and XC40, and IBM Blue Gene/Q), and on a heterogeneous GPU-CPU system (Cray XK7). As the bottlenecks shift from being compute-bound DGEMM's to communication-bound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling load-imbalance, tasking and bulk synchronous models. Nevertheless, we preserve a unified interface to both programming models to maintain the productivity of computational quantum chemists

Crossref

eScholarship - University of California

Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure (revised version)

Author: Bosilca George
Dongarra Jack,
Harrison Robert,
Herault Thomas
Lewis Cannada,
Robert Yves
Valeev Edward,
Publication venue: HAL CCSD
Publication date: 01/10/2020
Field of study

Many domains of scientific simulation (chemistry, condensed matter physics, data science) increasingly eschew dense tensors for block-sparse tensors, sometimes with additional structure (recursive hierarchy, rank sparsity, etc.). Distributed-memory parallel computation with block-sparse tensorial data is paramount to minimize the time-tosolution (e.g., to study dynamical problems or for real-time analysis) and to accommodateproblems of realistic size that are too large to fit into the host/device memory of a single node equipped with accelerators. Unfortunately, computation with such irregular data structures is a poor match to the dominant imperative, bulk-synchronous parallel programming model. In this paper, we focus on the critical element of block-sparse tensor algebra, namely binary tensor contraction, and report on an efficient and scalable implementation using the task-focused PaRSEC runtime. High performance of the block-sparse tensor contraction on the Summit supercomputer is demonstrated for synthetic data as well as for real data involved in electronic structure simulations of unprecedented size.Les tenseurs creux par blocs (block-sparse) sont présents dans de nombreux domaines scientifiques. Ce rapport étudie la parallélisation d’un noyau de contraction essentiel pour la manipulation de tels tenseurs, qui peut se matérialiser sous forme d’un produit de matrices C ← C + AB, où les trois matrices ont une structure creuse par blocs, où les tuiles de A et B sont de tailles hétérogènes, et où B est carrée de taille n, alors que A et C sont rectangulaires de taille m × n avec m << n. Nous proposons une implémentation sur la plate-forme Summit à mémoire distribuée, où chaque nœud est équipé de plusieurs GPUs, au sein de l’environnement de tâches PaRSEC. Nous obtenons de bonnes performances pour des problèmes de taille inégalées à ce jour

INRIA a CCSD electronic archive server

Coupled cluster theory on modern heterogeneous supercomputers

Author: Abdulrahman Y. Zamani
Andreas Erbs Hillers-Bendtsen
Ashleigh Barnes
Dmytro Bykov
Filip Pawłowski
Hector H. Corzo
Jeppe Olsen
Kurt V. Mikkelsen
Poul Jørgensen
Publication venue: Frontiers Media S.A.
Publication date: 01/06/2023
Field of study

This study examines the computational challenges in elucidating intricate chemical systems, particularly through ab-initio methodologies. This work highlights the Divide-Expand-Consolidate (DEC) approach for coupled cluster (CC) theory—a linear-scaling, massively parallel framework—as a viable solution. Detailed scrutiny of the DEC framework reveals its extensive applicability for large chemical systems, yet it also acknowledges inherent limitations. To mitigate these constraints, the cluster perturbation theory is presented as an effective remedy. Attention is then directed towards the CPS (D-3) model, explicitly derived from a CC singles parent and a doubles auxiliary excitation space, for computing excitation energies. The reviewed new algorithms for the CPS (D-3) method efficiently capitalize on multiple nodes and graphical processing units, expediting heavy tensor contractions. As a result, CPS (D-3) emerges as a scalable, rapid, and precise solution for computing molecular properties in large molecular systems, marking it an efficient contender to conventional CC models

Directory of Open Access Journals

Integral-Direct and Parallel Implementation of the CCSD(T) Method: Algorithmic Developments and Large-Scale Applications

Author: Gyevi-Nagy László
Kállay Mihály
Nagy Péter R.
Publication venue: 'American Chemical Society (ACS)'
Publication date: 14/01/2020
Field of study

Repository of the Academy's Library

Recommended from our members

Practical fast matrix multiplication algorithms

Author: Huang Jianyu
Publication venue
Publication date: 18/10/2018
Field of study

Matrix multiplication is a core building block for numerous scientific computing and, more recently, machine learning applications. Strassen's algorithm, the original Fast Matrix Multiplication (FMM) algorithm, has long fascinated computer scientists due to its startling property of reducing the number of computations required for multiplying n x n matrices from O(n³) to O(n [superscript 2.807]). Over the last half century, this has fueled many theoretical improvements such as other variations of Strassen-like FMM algorithms. Previous implementations of these FMM algorithms led to the "street wisdom" that they are only practical for large, relatively square matrices, that they require considerable workspace, and that they are difficult to achieve thread-level parallelism. The thesis of this work dispels these notions by demonstrating significant benefits for small and non-square matrices, requiring no workspace beyond what is already incorporated in high-performance implementations of matrix multiplication, and achieving performance benefits on multi-core, many-core, and distributed memory architectures.Computer Science

Texas ScholarWorks

Associative Instruction Reordering to Alleviate Register Pressure

Author: Pouchet Louis-Noël
Rastello Fabrice
Rountev Atanas
Sadayappan Ponnuswamy
Singh Rawat Prashant
Sukumaran-Rajam Aravind
Publication venue: HAL CCSD
Publication date: 11/11/2018
Field of study

International audienceRegister allocation is generally considered a practically solved problem. For most applications, the register allocation strategies in production compilers are very effective in controlling the number of loads/stores and register spills. However, existing register allocation strategies are not effective and result in excessive register spilling for computation patterns with a high degree of many-to-many data reuse, e.g., high-order stencils and tensor contractions. We develop a source-to-source instruction reordering strategy that exploits the flexibility of reordering associative operations to alleviate register pressure. The developed transformation module implements an adaptable strategy that can appropriately control the degree of instruction-level parallelism, while relieving register pressure. The effectiveness of the approach is demonstrated through experimental results using multiple production compilers (GCC, Clang/LLVM) and target platforms (Intel Xeon Phi, and Intel x86 multi-core)

INRIA a CCSD electronic archive server