33 research outputs found

    Optimization of condensed matter physics application with OpenMP tasking model

    Get PDF
    The Density Matrix Renormalization Group (DMRG++) is a condensed matter physics application used to study superconductivity properties of materials. It’s main computations consist of calculating hamiltonian matrix which requires sparse matrix-vector multiplications. This paper presents task-based parallelization and optimization strategies of the Hamiltonian algorithm. The algorithm is implemented as a mini-application in C++ and parallelized with OpenMP. The optimization leverages tasking features, such as dependencies or priorities included in the OpenMP standard 4.5. The code refactoring targets performance as much as programmability. The optimized version achieves a speedup of 8.0 × with 8 threads and 20.5 × with 40 threads on a Power9 computing node while reducing the memory consumption to 90 MB with respect to the original code, by adding less than ten OpenMP directives.This work is partially supported by the Spanish Government through Programa Severo Ochoa (SEV2015-0493), by the Spanish Ministry of Science and Technology (project TIN2015-65316-P), by the Generalitat de Catalunya (contract 2017-SGR-1414) and by the BSC-IBM Deep Learning Research Agreement, under JSA “Application porting, analysis and optimization for POWER and POWER AI”. This work was partially supported by the Scientific Discovery through Advanced Computing (SciDAC) program funded by U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research and Basic Energy Sciences, Division of Materials Sciences and Engineering. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.Peer ReviewedPostprint (author's final draft

    Role-shifting threads: Increasing OpenMP malleability to address load imbalance at MPI and OpenMP

    Get PDF
    This paper presents the evolution of the free agent threads for OpenMP to the new role-shifting threads model and their integration with the Dynamic Load Balancing (DLB) library. We demonstrate how free agent threads can improve resource utilization in OpenMP applications with load imbalance in their nested parallel regions. We also demonstrate how DLB efficiently manages the malleability exposed by the role-shifting threads to address load imbalance issues. We use three real-world scientific applications, one of them to demonstrate that free agents alone can improve the OpenMP model without external tools, and two other MPI+OpenMP applications, one of them with a coupling case, to illustrate the potential of the free agent threads’ malleability with an external resource manager to increase the efficiency of the system. In addition, we demonstrate that the new implementation is more usable than the former one, letting the runtime system automatically make decisions that were made by the programmer previously. All software is released open-source.This work has received funding from the DEEP Projects, at the European Commission’s FP7, H2020, and EuroHPC Programmes, under Grant Agreements 287530, 610476, 754304, and 955606. The PCI2021-121958 financed by the Spanish State Research Agency - Ministry of Science and Innovation. And it also has the support of the Spanish Ministry of Science and Innovation (Computacion de Altas Prestaciones VIII: PID2019-107255GB).Peer ReviewedPostprint (author's final draft

    Performance analysis and optimization of the FFTXlib on the Intel knights landing architecture

    Get PDF
    In this paper, we address the decreasing performance of the FFTXlib, the Fast Fourier Transformation (FFT) kernel of Quantum ESPRESSO, when scaling to a full KNL node. An increased performance in the FFTXlib will likewise increase the performance of the entire Quantum ESPRESSO code one of the most used plane-wave DFT codes in the community of material science. Our approach focuses on, first, overlapping computation and communication and, second, decreasing resource contention for higher compute efficiency. In order to achieve this we use the OmpSs programming model based on task dependencies. We allow overlapping of computation and communication by converting all steps of the FFT into tasks following a flow dependency. In the same way, we decrease resource contention by converting each FFT into an individual task that can be scheduled asynchronously. In both cases, multiple FFTs can be computed in parallel. The task-based optimizations are implemented in the FFTXlib and show up to 10% runtime reduction on the already highly optimized version. Since the task scheduling is done dynamically during execution by the parallel runtime, not statically by the user, it also frees the user from finding the ideal parallel configuration himself.We gratefully acknowledge the support of the MaX and POP projects, which have received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 676598 and 676553, respectively.Peer ReviewedPostprint (author's final draft

    Quantification of 3D spatial correlations between state variables and distances to the grain boundary network in full-field crystal plasticity spectral method simulations

    No full text
    Deformation microstructure heterogeneities play a pivotal role during dislocation patterning and interface network restructuring. Thus, they affect indirectly how an alloy recrystallizes if at all. Given this relevance, it has become common practice to study the evolution of deformation microstructure heterogeneities with 3D experiments and full-field crystal plasticity computer simulations including tools such as the spectral method. Quantifying material point to grain or phase boundary distances, though, is a practical challenge with spectral method crystal plasticity models because these discretize the material volume rather than mesh explicitly the grain and phase boundary interface network. This limitation calls for the development of interface reconstruction algorithms which enable us to develop specific data post-processing protocols to quantify spatial correlations between state variable values at each material point and the points' corresponding distance to the closest grain or phase boundary. This work contributes to advance such post-processing routines. Specifically, two grain reconstruction and three distancing methods are developed to solve above challenge. The individual strengths and limitations of these methods surplus the efficiency of their parallel implementation is assessed with an exemplary DAMASK large scale crystal plasticity study. We apply the new tool to assess the evolution of subtle stress and disorientation gradients towards grain boundaries.Comment: Manuscript submitted to Modelling and Simulation in Materials Science and Engineerin

    The LAPW method with eigendecomposition based on the Hari--Zimmermann generalized hyperbolic SVD

    Full text link
    In this paper we propose an accurate, highly parallel algorithm for the generalized eigendecomposition of a matrix pair (H,S)(H, S), given in a factored form (FJF,GG)(F^{\ast} J F, G^{\ast} G). Matrices HH and SS are generally complex and Hermitian, and SS is positive definite. This type of matrices emerges from the representation of the Hamiltonian of a quantum mechanical system in terms of an overcomplete set of basis functions. This expansion is part of a class of models within the broad field of Density Functional Theory, which is considered the golden standard in condensed matter physics. The overall algorithm consists of four phases, the second and the fourth being optional, where the two last phases are computation of the generalized hyperbolic SVD of a complex matrix pair (F,G)(F,G), according to a given matrix JJ defining the hyperbolic scalar product. If J=IJ = I, then these two phases compute the GSVD in parallel very accurately and efficiently.Comment: The supplementary material is available at https://web.math.pmf.unizg.hr/mfbda/papers/sm-SISC.pdf due to its size. This revised manuscript is currently being considered for publicatio

    Performance analysis and optimization of an HPC application: DMRG++

    Get PDF
    DMRG++ (Density Matrix Renormalization Group) és una aplicació de física de la matèria condensada orientada a HPC, originalment desenvolupada per l'Oak Ridge National Laboratory (ORNL). En aquest projecte es treballarà en la millora de la part de càlcul intensiu de l'aplicació, fent ús d'una miniapp que encapsula aquesta secció crítica. Partint d'una implementació inicial amb OpenMP basada en diversos parallel for aniuats, s'exploraran diferents alternatives per millorar el temps d'execució i el consum de memòria mitjançant el model de tasques amb dependències d'OpenMP, tot fent servir una estratègia d'anàlisi de l'aplicació i de desenvolupament iterativa. D'aquesta manera, no només esperem contribuir amb la millora d'una aplicació científica, sinó també mostrar tècniques d'anàlisi efectives i estratègies de paral·lelització per a aplicacions amb distribucions de feina molt desiguals.DMRG++ (Density Matrix Renormalization Group) is a condensed matter physics application oriented to HPC, developed by Oak Ridge National Laboratory (ORNL). In this project, we will focus on improving the intensive arithmetic kernel of the application, using a miniapp that encapsulates this critical program part. Starting with an initial implementation with OpenMP, which uses several nested parallel for, we will explore different alternatives to improve its execution time and memory consumption through OpenMP task dependency model, taking advantage of an iterative strategy of in-depth application analysis and development. In this way, we are not just contributing by improving a scientific application, but also showing effective analysis techniques and best practices for programmability and parallelization focused on applications with irregular workloads
    corecore