136 research outputs found

    Exascale Sparse Eigensolver Developments for Quantum Physics Applications

    Get PDF
    In the German Research Foundation (DFG) project ESSEX (Equipping Sparse Solvers for Exascale), we develop scalable sparse eigensolver libraries for large quantum physics problems. Partners in ESSEX are the Universities of Erlangen, Greifswald, Wuppertal, Tokyo and Tsukuba as well as DLR. The project pursues a coherent co-design of all software layers where a holistic performance engineering process guides code development across the classic boundaries of application, numerical method and basic kernel library. The basic building block library supports an elaborate MPI+X approach that is able to fully exploit hardware heterogeneity while exposing functional parallelism and data parallelism to all other software layers in a flexible way. The advanced building blocks were defined and employed by the developments at the algorithms layer. Here, ESSEX provides state-of-the-art library implementations of classic linear sparse eigenvalue solvers including block Jacobi-Davidson, Kernel Polynomial Method (KPM), and Chebyshev filter diagonalization (ChebFD) that are ready to use for production on modern heterogeneous compute nodes with best performance and numerical accuracy. Research in this direction included the development of appropriate parallel adaptive AMG software for the block Jacobi-Davidson method. Contour integral-based approaches were also covered in ESSEX and were extended in two directions: The FEAST method was further developed for improved scalability, and the Sakurai-Sugiura method (SSM) method was extended to nonlinear sparse eigenvalue problems. These developments were strongly supported by Japanese project partners from University of Tokyo, Computer Science, and University of Tsukuba, Applied Mathematics. The applications layer delivers scalable solutions for conservative (Hermitian) and dissipative (non-Hermitian) quantum systems with strong links to optics and biology and to novel materials such as graphene and topological insulators

    Equipping Sparse Solvers for Exascale - A Survey of the DFG Project ESSEX

    Get PDF
    The ESSEX project investigates computational issues arising at exascale for large-scale sparse eigenvalue problems and develops programming concepts and numerical methods for their solution. The project pursues a coherent co-design of all software layers where a holistic performance engineering process guides code development across the classic boundaries of application, numerical method and basic kernel library. Within ESSEX the numerical methods cover both widely applicable solvers such as classic Krylov, Jacobi-Davidson or recent FEAST methods as well as domain specific iterative schemes relevant for the ESSEX quantum physics application. This presentation introduces the project structure and presents selected results which demonstrate the potential impact of ESSEX for efficient sparse solvers on highly scalable heterogeneous supercomputers. In the second project phase from 2016 to 2018, the ESSEX consortium will include partners from the Universities of Tokyo and of Tsukuba. Extensions of existing work will regard numerically reliable computing methods, scalability improvements by leveraging functional parallelism in asynchronous preconditioners, hiding and reducing communication cost, improving load balancing by advanced partitioning schemes, as well as the treatment of non-Hermitian matrix problems

    Scalable Multi-Resolution Streaming for the interactive Analysis of Large Simulation Data Sets

    Get PDF
    Die interaktive Analyse großer Simulationsdatensätze soll durch skalierbares Multi-Resolution-Streaming ermöglicht werden. Da sehr große Simulationsdatensätze, die z.B. bei TRACE-Läufen erzeugt werden, nicht in kurzer Zeit voll-ständig verarbeitet werden können, sollen interaktive Visualisierungsmethoden entwickelt werden, die • durch Multi-Resolution-Datenstrukturen einen schnellen Zugriff auf Teildaten erlauben, • Many-Core-Systeme für ein effizientes, paralleles Postprocessing verwenden und • moderne GPU-Architekturen sowohl zur Unterstützung des Postprocessings als auch für eine qualitativ hochwertige Darstellung der Ergebnisse mit interakti-ven Bildwiederholungsraten nutzen. Das zentrale Ziel besteht darin, alle Stufen der kompletten Postprocessing-Pipeline permanent mit ausreichend Daten zu versorgen, um immer detailliertere Ergebnisse für die interaktive Analyse zu erzeugen

    Performance and productivity of parallel python programming: a study with a CFD test case

    Get PDF
    The programming language Python is widely used to create rapidly compact software. However, compared to low-level programming languages like C or Fortran low performance is preventing its use for HPC applications. Efficient parallel programming of multi-core systems and graphic cards is generally a complex task. Python with add-ons might provide a simple approach to program those systems. This paper evaluates the performance of Python implementations with different libraries and compares it to implementations in C or Fortran. As a test case from the field of computational fluid dynamics (CFD) a part of a rotor simulation code was selected. Fortran versions of this code were available for use on single-core, multi-core and graphic-card systems. For all these computer systems, multiple compact versions of the code were implemented in Python with different libraries. For performance analysis of the rotor simulation kernel, a performance model was developed. This model was then employed to assess the performance reached with the different implementations. Performance tests showed that an implementation with Python syntax is six times slower than Fortran on single-core systems. The performance on multi-core systems and graphic cards is about a tenth of the Fortran implementations. A higher performance was achieved by a hybrid implementation in C and Python using Cython. The latter reached about half of the performance of the Fortran implementation

    Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation

    Get PDF
    Sparse matrix-vector multiplication (spMVM) is the dominant operation in many sparse solvers. We investigate performance properties of spMVM with matrices of various sparsity patterns on the nVidia "Fermi" class of GPGPUs. A new "padded jagged diagonals storage" (pJDS) format is proposed which may substantially reduce the memory overhead intrinsic to the widespread ELLPACK-R scheme. In our test scenarios the pJDS format cuts the overall spMVM memory footprint on the GPGPU by up to 70%, and achieves 95% to 130% of the ELLPACK-R performance. Using a suitable performance model we identify performance bottlenecks on the node level that invalidate some types of matrix structures for efficient multi-GPGPU parallelization. For appropriate sparsity patterns we extend previous work on distributed-memory parallel spMVM to demonstrate a scalable hybrid MPI-GPGPU code, achieving efficient overlap of communication and computation.Comment: 10 pages, 5 figures. Added reference to other recent sparse matrix format

    Performance of Low-Rank Tensor Algorithms

    Get PDF
    We discuss low-rank tensor algorithms and in particular algorithms for the tensor-train (TT) format (known as MPS in computational physics). We focus on the required building blocks and model their node-level performance on modern multi-core CPUs. More specifically, we consider the lossy compression of large dense data (TT-SVD), as well as linear solvers in TT format (TT-MALS, TT-GMRES). For the data compression, we derive the optimal roofline runtime for the complete algorithm based on the two main building blocks in an optimized implementation: Q-less TSQR and tall-skinny matrix-matrix multiplication. For the low-rank linear solvers, we categorize the different kinds of building blocks according to performance characteristics and show possible performance optimizations. While all required tensor operations can be mapped onto standard BLAS/LAPACK routines theoretically, faster implementations need specific performance optimizations: These include (1) avoiding costly singular-value decompositions (SVDs), and (2) employing special fused operations for sequences of memory-bound tensor-contractions and reshaping operations, as well as (3) tracking properties of tensors such as orthogonalities. We show the effect of the different optimizations and compare the runtime of our implementation with other tensor libraries

    Performance of high-order SVD approximation: reading the data twice is enough

    Get PDF
    Performance of high-order SVD approximation: reading the data twice is enough ============================================================================= This talk considers the problem of calculating a low-rank tensor approximation of some large dense data. We focus on the tensor train SVD (TT-SVD) but the approach can be transferred to other low-rank tensor formats such as general tree tensor networks. In the TT-SVD algorithm, the dominant building block consists of singular value decompositions of tall-skinny matrices. Therefore, the computational performance is bound by data transfers on current hardware as long as the desired tensor ranks are sufficiently small. Based on a simple roofline performance model we show that under reasonable assumptions the minimal runtime is of the order of reading the data twice. We present an almost optimal, distributed parallel implementation that is based on a specialized rank-preserving TSQR step. Moreover, we discuss important algorithmic details and compare our results with common implementations that are often about 50x slower than optimal. References: Oseledets: "Tensor-Train Decomposition", SISC 2011 Grasedyck and Hackbusch: "An Introduction to Hierarchical (H-) Rank and TT-Rank of Tensors with Examples", CMAM 2011 Demmel et. al.: "Communication Avoiding Rank Revealing QR Factorization with Column Pivoting", SIMAX 2015 Williams et. al.: "Roofline: An Insightful Visual Performance Model for Multicore Architectures", CACM 200

    GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems

    Get PDF
    While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring "standard" as well as "accelerated" resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. The library code and several applications are available as open source. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack.Comment: 32 pages, 11 figure
    • …
    corecore