17 research outputs found

    Automated Scheduling Algorithm Selection and Chunk Parameter Calculation in OpenMP

    Get PDF
    Increasing node and cores-per-node counts in supercomputers render scheduling and load balancing critical for exploiting parallelism. OpenMP applications can achieve high performance via careful selection of scheduling kind and chunk parameters on a per-loop, per-application, and per-system basis from a portfolio of advanced scheduling algorithms (Korndörfer et al. , 2022). This selection approach is time-consuming, challenging, and may need to change during execution. We propose Auto4OMP , a novel approach for automated load balancing of OpenMP applications. With Auto4OMP, we introduce three scheduling algorithm selection methods and an expert-defined chunk parameter for OpenMP's schedule clause's kind and chunk , respectively. Auto4OMP extends the OpenMP schedule(auto) and chunk parameter implementation in LLVM's OpenMP runtime library to automatically select a scheduling algorithm and calculate a chunk parameter during execution. Loop characteristics are inferred in Auto4OMP from the loop execution over the application's time-steps. The experiments performed in this work show that Auto4OMP improves applications performance by up to 11 % compared to LLVM's schedule(auto) implementation and outperforms manual selection. Auto4OMP improves MPI+OpenMP applications performance by explicitly minimizing thread- and implicitly reducing process-load imbalance

    MLS: Multilevel Scheduling in Large Scale High Performance

    No full text
    High performance computing systems are of increased size (in terms of node count, core count, and core types per node), resulting in increased available hardware parallelism. Hardware parallelism can be found at several levels, from machine instructions to global computing sites. Unfortunately, exposing, expressing, and exploiting parallelism is difficult when considering the increase in parallelism within each level and when exploiting more than a single or even a couple of parallelism levels. The multilevel scheduling (MLS) project aims to offer an answer to the following research question: Given massive parallelism, at multiple levels, and of diverse forms and granularities, how can it be exposed, expressed, and exploited such that execution times are reduced, performance targets are achieved, and acceptable efficiency is maintained? The MLS project investigates the development of a multilevel approach for achieving scalable scheduling in large scale high performance computing systems across the multiple levels of parallelism, with a focus on software parallelism. By integrating multiple levels of parallelism, MLS differs from hierarchical scheduling, traditionally employed to achieve scalability within a single level of parallelism. Specifically, MLS extends and bridges the most successful (batch, application, and thread) scheduling models beyond a single or a couple of parallelism levels (scaling across) and beyond their current scale (scaling out). Via the MLS approach, the project aims to leverage all available parallelism and address hardware heterogeneity in large scale high performance computers such that execution times are reduced, performance targets are achieved, and acceptable efficiency is maintained

    Finding Neighbors in a Forest: A b-tree for Smoothed Particle Hydrodynamics Simulations

    No full text
    Finding the exact close neighbors of each fluid element in mesh-free computational hydrodynamical methods, such as the Smoothed Particle Hydrodynamics (SPH), often becomes a main bottleneck for scaling their performance beyond a few million fluid elements per computing node. Tree structures are particularly suitable for SPH simulation codes, which rely on finding the exact close neighbors of each fluid element (or SPH particle). In this work we present a novel tree structure, named \textit{b-tree}, which features an adaptive branching factor to reduce the depth of the neighbor search. Depending on the particle spatial distribution, finding neighbors using \tree has an asymptotic best case complexity of O(n), as opposed to O(nlogn) for other classical tree structures such as octrees and quadtrees. We also present the proposed tree structure as well as the algorithms to build it and to find the exact close neighbors of all particles. We assess the scalability of the proposed tree-based algorithms through an extensive set of performance experiments in a shared-memory system. Results show that b-tree is up to 12Ă— faster for building the tree and up to 1.6Ă— faster for finding the exact neighbors of all particles when compared to its octree form. Moreover, we apply b-tree to a SPH code and show its usefulness over the existing octree implementation, where b-tree is up to 5Ă— faster for finding the exact close neighbors compared to the legacy code

    Mapping Matters: Application Process Mapping on 3-D Processor Topologies

    No full text
    International audienceApplications’ performance is influenced by the mapping of processes to computing nodes, the frequency and volume of exchanges among processing elements, the network capacity, and the routing protocol. A poor mapping of application processes degrades performance and wastes resources. As process mapping is frequently ignored as an explicit optimization step (since the system typically offers a default mapping), users may lack awareness of their applications’ communication behavior, and the opportunities for improving performance through mapping are often unclear. This work studies the impact of application process mapping on several processor topologies. We propose and apply a generic workflow that renders mapping as an explicit optimization step to a set of four applications, twelve mapping algorithms, and three direct network topologies. We assess the mappings’ quality in terms of volume, frequency, and distance of exchanges using metrics such as dilation (measured in hop·Byte). With a parallel trace-based simulator, we predictthe applications’ execution on the three topologies using the twelve mappings. To ensure the correctness of the simulations, we compare the pre- and post-simulation results. This work emphasizes the importance of process mapping as an explicit optimization step and offers a solution for parallel applications to exploit the full potential of the allocated resources on a given system

    A Runtime Approach for Dynamic Load Balancing of OpenMP Parallel Loops in LLVM

    No full text
    Load imbalance is the major source of performance degradation in computationally-intensive applications that frequently consist of parallel loops. Efficient scheduling of parallel loops can improve the performance of such programs. OpenMP is the de-facto standard for parallel programming on shared-memory systems. The current OpenMP specification provides only three choices for loop scheduling which are insufficient in scenarios with irregular loops, system-induced interference, or both. Therefore, this work augments the LLVM implementation of the OpenMP runtime library with eleven state-of-the-art plus three new and ready-to-use scheduling techniques. We tested the existing and the added loop scheduling strategies on several applications from the NAS, SPEC OMP 2012, and CORAL-2 benchmark suites. The experimental results show that each newly implemented scheduling technique outperforms the other in certain application and system configurations. We measured performance gains of up to 6% compared to the fastest previously available scheduling techniques. This work establishes the importance of beyond-standard scheduling options in OpenMP for the benefit of evolving applications executing on evolving multicore architectures

    Robusimpact - Design report of the specimens for all the experimental analyses - Deliverable 4.1

    Full text link
    The present report focuses on the design of the experimental analysis that are going to be performed within the ROBUSTIMPACT project (Grant Agreement Number: RFSR-CT-2012-00029). The project focuses on the behavior of composite steel and concrete framed buildings against accidental actions. Within the project, several experimental analyses are going to be performed spanning from the local to the global behavior
    corecore