15 research outputs found

    Performance analysis of a hybrid MPI/OpenMP application on multi-core clusters

    Get PDF
    The mixing of shared memory and message passing programming models within a single application has often been suggested as a method for improving scientific application performance on clusters of shared memory or multi-core systems. DL_POLY, a large scale molecular dynamics application programmed using message passing programming has been modified to add a layer of shared memory threading and the performance analysed on two multi-core clusters. At lower processor numbers, the extra overheads from shared memory threading in the hybrid code outweigh performance benefits gained over the pure MPI code. On larger core counts the hybrid model performs better than pure MPI, with reduced communication time decreasing the overall runtime

    Hierarchical Dynamic Loop Self-Scheduling on Distributed-Memory Systems Using an MPI+MPI Approach

    Full text link
    Computationally-intensive loops are the primary source of parallelism in scientific applications. Such loops are often irregular and a balanced execution of their loop iterations is critical for achieving high performance. However, several factors may lead to an imbalanced load execution, such as problem characteristics, algorithmic, and systemic variations. Dynamic loop self-scheduling (DLS) techniques are devised to mitigate these factors, and consequently, improve application performance. On distributed-memory systems, DLS techniques can be implemented using a hierarchical master-worker execution model and are, therefore, called hierarchical DLS techniques. These techniques self-schedule loop iterations at two levels of hardware parallelism: across and within compute nodes. Hybrid programming approaches that combine the message passing interface (MPI) with open multi-processing (OpenMP) dominate the implementation of hierarchical DLS techniques. The MPI-3 standard includes the feature of sharing memory regions among MPI processes. This feature introduced the MPI+MPI approach that simplifies the implementation of parallel scientific applications. The present work designs and implements hierarchical DLS techniques by exploiting the MPI+MPI approach. Four well-known DLS techniques are considered in the evaluation proposed herein. The results indicate certain performance advantages of the proposed approach compared to the hybrid MPI+OpenMP approach

    Hybrid MPI+OpenMP parallelization of an FFT-based 3D Poisson solver that can reach 100000 CPU cores

    Get PDF
    This work is devoted to the development of efficient parallel algorithms for the direct numerical simulation (DNS) of incompressible flows on modern supercomputers. A Poisson solver for problems with one uniform periodic direction is presented here. It is extended with a two-level hybrid MPI+OpenMP parallelization. Advantages and implementation details for the additional OpenMP parallelization are presented and discussed. This upgrade has allowed to significantly extend the range of efficient scalability. Here, the solver has been tested up to 12800 CPU cores for meshes with up to 10 9 nodes. However, estimations based on the presented results show that this range can be potentially stretched beyond 10 5 cores.Peer ReviewedPostprint (author’s final draft

    A design pattern for optimizations in data intensive applications using ABS and JAVA 8

    Get PDF
    Cloud environments have become a standard method for enterprises to offer their applications by means of web services, data management systems, or simply renting out computing resources. In our previous work, we presented how we can use a modeling language together with the new features of JAVA 8 to overcome certain drawbacks of data structures and synchronization mechanisms in parallel applications. We extend this solution into a design pattern that allows application-specific optimizations in a distributed setting. We validate this integration using our previous case study of the Prime Sieve of Eratosthenes and illustrate the performance improvements in terms of speed-up and memory co

    A design pattern for optimizations in data intensive applications using ABS and JAVA 8

    Get PDF
    Cloud environments have become a standard method for enterprises to offer their applications by means of web services, data management systems, or simply renting out computing resources. In our previous work, we presented how we can use a modeling language together with the new features of JAVA 8 to overcome certain drawbacks of data structures and synchronization mechanisms in parallel applications. We extend this solution into a design pattern that allows application-specific optimizations in a distributed setting. We validate this integration using our previous case study of the Prime Sieve of Eratosthenes and illustrate the performance improvements in terms of speed-up and memory consumption

    Using an Improved Data Structure in Hybrid Memory for Agent-Based Simulation

    Get PDF
    Data structure is an important issue to get good performance in parallel and distributed applications. These data structures have to be designed with the memory paradigm in mind where the data structure will be used in order to explore the architecture in a better way and subsequently obtain the best Speedup. Current parallel programming languages enable us to easily transform a parallel solution developed for a distributed paradigm to a hybrid solution just by adding pragma codes. At first approach, this is an interesting solution because it does not require several code modifications. Nevertheless, this interchange can cause a slowdown if an appropriate and deep adaptation is not carried out in the code. In this paper, we present our experience when we migrated a data structure developed for a distributed paradigm to a hybrid paradigm. This data structure was implemented in our Fish Schooling Agent-Based simulator where it might be useful either as a distributed paradigm or a hybrid paradigm. The results show the importance of customizing the data structure for the appropriate infrastructure and parallel programming paradigm. We believe that the data structure should have a flexible and dynamic behavior in accordance with the paradigm used.XVIII Workshop de Procesamiento Distribuido y Paralelo (WPDP).Red de Universidades con Carreras en Informática (RedUNCI

    A Simulation Suite for Lattice-Boltzmann based Real-Time CFD Applications Exploiting Multi-Level Parallelism on Modern Multi- and Many-Core Architectures

    Get PDF
    We present a software approach to hardware-oriented numerics which builds upon an augmented, previously published open-source set of libraries facilitating portable code development and optimisation on a wide range of modern computer architectures. In order to maximise eficiency, we exploit all levels of arallelism, including vectorisation within CPU cores, the Cell BE and GPUs, shared memory thread-level parallelism between cores, and parallelism between heterogeneous distributed memory resources in clusters. To evaluate and validate our approach, we implement a collection of modular building blocks for the easy and fast assembly and development of CFD applications based on the shallow water equations: We combine the Lattice-Boltzmann method with i-uid-structure interaction techniques in order to achieve real-time simulations targeting interactive virtual environments. Our results demonstrate that recent multi-core CPUs outperform the Cell BE, while GPUs are significantly faster than conventional multi-threaded SSE code. In addition, we verify good scalability properties of our application on small clusters

    A Simulation Suite for Lattice-Boltzmann based Real-Time CFD Applications Exploiting Multi-Level Parallelism on Modern Multi- and Many-Core Architectures

    Get PDF
    We present a software approach to hardware-oriented numerics which builds upon an augmented, previously published open-source set of libraries facilitating portable code development and optimisation on a wide range of modern computer architectures. In order to maximise eficiency, we exploit all levels of arallelism, including vectorisation within CPU cores, the Cell BE and GPUs, shared memory thread-level parallelism between cores, and parallelism between heterogeneous distributed memory resources in clusters. To evaluate and validate our approach, we implement a collection of modular building blocks for the easy and fast assembly and development of CFD applications based on the shallow water equations: We combine the Lattice-Boltzmann method with i-uid-structure interaction techniques in order to achieve real-time simulations targeting interactive virtual environments. Our results demonstrate that recent multi-core CPUs outperform the Cell BE, while GPUs are significantly faster than conventional multi-threaded SSE code. In addition, we verify good scalability properties of our application on small clusters
    corecore