11 research outputs found

    Automatic transformation of irreducible representations for efficient contraction of tensors with cyclic group symmetry

    Get PDF
    Tensor contractions are ubiquitous in computational chemistry and physics, where tensors generally represent states or operators and contractions are transformations. In this context, the states and operators often preserve physical conservation laws, which are manifested as group symmetries in the tensors. These group symmetries imply that each tensor has block sparsity and can be stored in a reduced form. For nontrivial contractions, the memory footprint and cost are lowered, respectively, by a linear and a quadratic factor in the number of symmetry sectors. State-of-the-art tensor contraction software libraries exploit this opportunity by iterating over blocks or using general block-sparse tensor representations. Both approaches entail overhead in performance and code complexity. With intuition aided by tensor diagrams, we present a technique, irreducible representation alignment, which enables efficient handling of Abelian group symmetries via only dense tensors, by using contraction-specific reduced forms. This technique yields a general algorithm for arbitrary group symmetric contractions, which we implement in Python and apply to a variety of representative contractions from quantum chemistry and tensor network methods. As a consequence of relying on only dense tensor contractions, we can easily make use of efficient batched matrix multiplication via Intel's MKL and distributed tensor contraction via the Cyclops library, achieving good efficiency and parallel scalability on up to 4096 Knights Landing cores of a supercomputer

    Dataflow Programming Paradigms for Computational Chemistry Methods

    Get PDF
    The transition to multicore and heterogeneous architectures has shaped the High Performance Computing (HPC) landscape over the past decades. With the increase in scale, complexity, and heterogeneity of modern HPC platforms, one of the grim challenges for traditional programming models is to sustain the expected performance at scale. By contrast, dataflow programming models have been growing in popularity as a means to deliver a good balance between performance and portability in the post-petascale era. This work introduces dataflow programming models for computational chemistry methods, and compares different dataflow executions in terms of programmability, resource utilization, and scalability. This effort is driven by computational chemistry applications, considering that they comprise one of the driving forces of HPC. In particular, many-body methods, such as Coupled Cluster methods (CC), which are the gold standard to compute energies in quantum chemistry, are of particular interest for the applied chemistry community. On that account, the latest development for CC methods is used as the primary vehicle for this research, but our effort is not limited to CC and can be applied across other application domains. Two programming paradigms for expressing CC methods into a dataflow form, in order to make them capable of utilizing task scheduling systems, are presented. Explicit dataflow, is the programming model where the dataflow is explicitly specified by the developer, is contrasted with implicit dataflow, where a task scheduling runtime derives the dataflow. An abstract model is derived to explore the limits of the different dataflow programming paradigms

    Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure (revised version)

    Get PDF
    Many domains of scientific simulation (chemistry, condensed matter physics, data science) increasingly eschew dense tensors for block-sparse tensors, sometimes with additional structure (recursive hierarchy, rank sparsity, etc.). Distributed-memory parallel computation with block-sparse tensorial data is paramount to minimize the time-tosolution (e.g., to study dynamical problems or for real-time analysis) and to accommodateproblems of realistic size that are too large to fit into the host/device memory of a single node equipped with accelerators. Unfortunately, computation with such irregular data structures is a poor match to the dominant imperative, bulk-synchronous parallel programming model. In this paper, we focus on the critical element of block-sparse tensor algebra, namely binary tensor contraction, and report on an efficient and scalable implementation using the task-focused PaRSEC runtime. High performance of the block-sparse tensor contraction on the Summit supercomputer is demonstrated for synthetic data as well as for real data involved in electronic structure simulations of unprecedented size.Les tenseurs creux par blocs (block-sparse) sont présents dans de nombreux domaines scientifiques. Ce rapport étudie la parallélisation d’un noyau de contraction essentiel pour la manipulation de tels tenseurs, qui peut se matérialiser sous forme d’un produit de matrices C ← C + AB, où les trois matrices ont une structure creuse par blocs, où les tuiles de A et B sont de tailles hétérogènes, et où B est carrée de taille n, alors que A et C sont rectangulaires de taille m × n avec m << n. Nous proposons une implémentation sur la plate-forme Summit à mémoire distribuée, où chaque nœud est équipé de plusieurs GPUs, au sein de l’environnement de tâches PaRSEC. Nous obtenons de bonnes performances pour des problèmes de taille inégalées à ce jour

    Coupled cluster theory on modern heterogeneous supercomputers

    Get PDF
    This study examines the computational challenges in elucidating intricate chemical systems, particularly through ab-initio methodologies. This work highlights the Divide-Expand-Consolidate (DEC) approach for coupled cluster (CC) theory—a linear-scaling, massively parallel framework—as a viable solution. Detailed scrutiny of the DEC framework reveals its extensive applicability for large chemical systems, yet it also acknowledges inherent limitations. To mitigate these constraints, the cluster perturbation theory is presented as an effective remedy. Attention is then directed towards the CPS (D-3) model, explicitly derived from a CC singles parent and a doubles auxiliary excitation space, for computing excitation energies. The reviewed new algorithms for the CPS (D-3) method efficiently capitalize on multiple nodes and graphical processing units, expediting heavy tensor contractions. As a result, CPS (D-3) emerges as a scalable, rapid, and precise solution for computing molecular properties in large molecular systems, marking it an efficient contender to conventional CC models

    Optimizing work stealing algorithms with scheduling constraints

    Get PDF
    The fork-join paradigm of concurrent expression has gained popularity in conjunction with work-stealing schedulers. Random work-stealing schedulers have been shown to effectively perform dynamic load balancing, yielding provably-efficient schedules and space bounds on shared-memory architectures with uniform memory models. However, the advent of hierarchical, non-uniform multicore systems and large-scale distributed-memory architectures has reduced the efficacy of these scheduling policies. Furthermore, random work stealing schedulers do not exploit persistence within iterative, scientific applications. In this thesis, we prove several properties of work-stealing schedulers that enable online tracing of the tasks with very low overhead. We then describe new scheduling policies that use online schedule introspection to understand scheduler placement and thus improve the performance on NUMA and distributed-memory architectures. Finally, by incorporating an inclusive data effect system into fork--join programs with schedule placement knowledge, we show how we can transform a fork-join program to significantly improve locality
    corecore