33 research outputs found
Optimization of condensed matter physics application with OpenMP tasking model
The Density Matrix Renormalization Group (DMRG++) is a condensed matter physics application used to study superconductivity properties of materials. It’s main computations consist of calculating hamiltonian matrix which requires sparse matrix-vector multiplications. This paper presents task-based parallelization and optimization strategies of the Hamiltonian algorithm. The algorithm is implemented as a mini-application in C++ and parallelized with OpenMP. The optimization leverages tasking features, such as dependencies or priorities included in the OpenMP standard 4.5. The code refactoring targets performance as much as programmability. The optimized version achieves a speedup of 8.0 × with 8 threads and 20.5 × with 40 threads on a Power9 computing node while reducing the memory consumption to 90 MB with respect to the original code, by adding less than ten OpenMP directives.This work is partially supported by the Spanish Government through Programa Severo Ochoa (SEV2015-0493), by the Spanish Ministry of Science and Technology (project TIN2015-65316-P), by the Generalitat de Catalunya (contract 2017-SGR-1414) and by the BSC-IBM Deep Learning Research Agreement, under JSA “Application porting, analysis and optimization for POWER and POWER AI”. This work was partially supported by the Scientific Discovery through Advanced Computing (SciDAC) program funded by U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research and Basic Energy Sciences, Division of Materials Sciences and Engineering. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.Peer ReviewedPostprint (author's final draft
Role-shifting threads: Increasing OpenMP malleability to address load imbalance at MPI and OpenMP
This paper presents the evolution of the free agent threads for OpenMP to the new role-shifting threads model and
their integration with the Dynamic Load Balancing (DLB) library. We demonstrate how free agent threads can improve
resource utilization in OpenMP applications with load imbalance in their nested parallel regions. We also demonstrate
how DLB efficiently manages the malleability exposed by the role-shifting threads to address load imbalance issues.
We use three real-world scientific applications, one of them to demonstrate that free agents alone can improve the
OpenMP model without external tools, and two other MPI+OpenMP applications, one of them with a coupling case, to
illustrate the potential of the free agent threads’ malleability with an external resource manager to increase the efficiency
of the system. In addition, we demonstrate that the new implementation is more usable than the former one, letting the
runtime system automatically make decisions that were made by the programmer previously. All software is released
open-source.This work has received funding from the DEEP
Projects, at the European Commission’s FP7, H2020, and EuroHPC
Programmes, under Grant Agreements 287530, 610476, 754304, and
955606. The PCI2021-121958 financed by the Spanish State Research
Agency - Ministry of Science and Innovation. And it also has the support
of the Spanish Ministry of Science and Innovation (Computacion de Altas
Prestaciones VIII: PID2019-107255GB).Peer ReviewedPostprint (author's final draft
Performance analysis and optimization of the FFTXlib on the Intel knights landing architecture
In this paper, we address the decreasing performance of the FFTXlib, the Fast Fourier Transformation (FFT) kernel of Quantum ESPRESSO, when scaling to a full KNL node. An increased performance in the FFTXlib will likewise increase the performance of the entire Quantum ESPRESSO code one of the most used plane-wave DFT codes in the community of material science. Our approach focuses on, first, overlapping computation and communication and, second, decreasing resource contention for higher compute efficiency. In order to achieve this we use the OmpSs programming model based on task dependencies. We allow overlapping of computation and communication by converting all steps of the FFT into tasks following a flow dependency. In the same way, we decrease resource contention by converting each FFT into an individual task that can be scheduled asynchronously. In both cases, multiple FFTs can be computed in parallel. The task-based optimizations are implemented in the FFTXlib and show up to 10% runtime reduction on the already highly optimized version. Since the task scheduling is done dynamically during execution by the parallel runtime, not statically by the user, it also frees the user from finding the ideal parallel configuration himself.We gratefully acknowledge the support of the MaX and POP projects, which have received funding from the European Union’s Horizon 2020 research and innovation programme
under grant agreement No. 676598 and 676553, respectively.Peer ReviewedPostprint (author's final draft
Quantification of 3D spatial correlations between state variables and distances to the grain boundary network in full-field crystal plasticity spectral method simulations
Deformation microstructure heterogeneities play a pivotal role during
dislocation patterning and interface network restructuring. Thus, they affect
indirectly how an alloy recrystallizes if at all. Given this relevance, it has
become common practice to study the evolution of deformation microstructure
heterogeneities with 3D experiments and full-field crystal plasticity computer
simulations including tools such as the spectral method.
Quantifying material point to grain or phase boundary distances, though, is a
practical challenge with spectral method crystal plasticity models because
these discretize the material volume rather than mesh explicitly the grain and
phase boundary interface network. This limitation calls for the development of
interface reconstruction algorithms which enable us to develop specific data
post-processing protocols to quantify spatial correlations between state
variable values at each material point and the points' corresponding distance
to the closest grain or phase boundary.
This work contributes to advance such post-processing routines. Specifically,
two grain reconstruction and three distancing methods are developed to solve
above challenge. The individual strengths and limitations of these methods
surplus the efficiency of their parallel implementation is assessed with an
exemplary DAMASK large scale crystal plasticity study. We apply the new tool to
assess the evolution of subtle stress and disorientation gradients towards
grain boundaries.Comment: Manuscript submitted to Modelling and Simulation in Materials Science
and Engineerin
The LAPW method with eigendecomposition based on the Hari--Zimmermann generalized hyperbolic SVD
In this paper we propose an accurate, highly parallel algorithm for the
generalized eigendecomposition of a matrix pair , given in a factored
form . Matrices and are generally complex
and Hermitian, and is positive definite. This type of matrices emerges from
the representation of the Hamiltonian of a quantum mechanical system in terms
of an overcomplete set of basis functions. This expansion is part of a class of
models within the broad field of Density Functional Theory, which is considered
the golden standard in condensed matter physics. The overall algorithm consists
of four phases, the second and the fourth being optional, where the two last
phases are computation of the generalized hyperbolic SVD of a complex matrix
pair , according to a given matrix defining the hyperbolic scalar
product. If , then these two phases compute the GSVD in parallel very
accurately and efficiently.Comment: The supplementary material is available at
https://web.math.pmf.unizg.hr/mfbda/papers/sm-SISC.pdf due to its size. This
revised manuscript is currently being considered for publicatio
Performance analysis and optimization of an HPC application: DMRG++
DMRG++ (Density Matrix Renormalization Group) és una aplicació de física de la matèria condensada orientada a HPC, originalment desenvolupada per l'Oak Ridge National Laboratory (ORNL). En aquest projecte es treballarà en la millora de la part de càlcul intensiu de l'aplicació, fent ús d'una miniapp que encapsula aquesta secció crítica. Partint d'una implementació inicial amb OpenMP basada en diversos parallel for aniuats, s'exploraran diferents alternatives per millorar el temps d'execució i el consum de memòria mitjançant el model de tasques amb dependències d'OpenMP, tot fent servir una estratègia d'anàlisi de l'aplicació i de desenvolupament iterativa. D'aquesta manera, no només esperem contribuir amb la millora d'una aplicació científica, sinó també mostrar tècniques d'anàlisi efectives i estratègies de paral·lelització per a aplicacions amb distribucions de feina molt desiguals.DMRG++ (Density Matrix Renormalization Group) is a condensed matter physics application oriented to HPC, developed by Oak Ridge National Laboratory (ORNL). In this project, we will focus on improving the intensive arithmetic kernel of the application, using a miniapp that encapsulates this critical program part. Starting with an initial implementation with OpenMP, which uses several nested parallel for, we will explore different alternatives to improve its execution time and memory consumption through OpenMP task dependency model, taking advantage of an iterative strategy of in-depth application analysis and development. In this way, we are not just contributing by improving a scientific application, but also showing effective analysis techniques and best practices for programmability and parallelization focused on applications with irregular workloads