1,113 research outputs found

    Improving the scalability of parallel N-body applications with an event driven constraint based execution model

    Full text link
    The scalability and efficiency of graph applications are significantly constrained by conventional systems and their supporting programming models. Technology trends like multicore, manycore, and heterogeneous system architectures are introducing further challenges and possibilities for emerging application domains such as graph applications. This paper explores the space of effective parallel execution of ephemeral graphs that are dynamically generated using the Barnes-Hut algorithm to exemplify dynamic workloads. The workloads are expressed using the semantics of an Exascale computing execution model called ParalleX. For comparison, results using conventional execution model semantics are also presented. We find improved load balancing during runtime and automatic parallelism discovery improving efficiency using the advanced semantics for Exascale computing.Comment: 11 figure

    Integrating Algorithmic and Systemic Load Balancing Strategies in Parallel Scientific Applications

    Get PDF
    Load imbalance is a major source of performance degradation in parallel scientific applications. Load balancing increases the efficient use of existing resources and improves performance of parallel applications running in distributed environments. At a coarse level of granularity, advances in runtime systems for parallel programs have been proposed in order to control available resources as efficiently as possible by utilizing idle resources and using task migration. At a finer granularity level, advances in algorithmic strategies for dynamically balancing computational loads by data redistribution have been proposed in order to respond to variations in processor performance during the execution of a given parallel application. Algorithmic and systemic load balancing strategies have complementary set of advantages. An integration of these two techniques is possible and it should result in a system, which delivers advantages over each technique used in isolation. This thesis presents a design and implementation of a system that combines an algorithmic fine-grained data parallel load balancing strategy called Fractiling with a systemic coarse-grained task-parallel load balancing system called Hector. It also reports on experimental results of running N-body simulations under this integrated system. The experimental results indicate that a distributed runtime environment, which combines both algorithmic and systemic load balancing strategies, can provide performance advantages with little overhead, underscoring the importance of this approach in large complex scientific applications

    Dynamic Loop Scheduling Using MPI Passive-Target Remote Memory Access

    Get PDF
    Scientific applications often contain large computationally-intensive parallel loops. Loop scheduling techniques aim to achieve load balanced executions of such applications. For distributed-memory systems, existing dynamic loop scheduling (DLS) libraries are typically MPI-based, and employ a master-worker execution model to assign variably-sized chunks of loop iterations. The master-worker execution model may adversely impact performance due to the master-level contention. This work proposes a distributed chunk-calculation approach that does not require the master-worker execution scheme. Moreover, it considers the novel features in the latest MPI standards, such as passive-target remote memory access, shared-memory window creation, and atomic read-modify-write operations. To evaluate the proposed approach, five well-known DLS techniques, two applications, and two heterogeneous hardware setups have been considered. The DLS techniques implemented using the proposed approach outperformed their counterparts implemented using the traditional master-worker execution model

    An efficient parallel tree-code for the simulation of self-gravitating systems

    Get PDF
    We describe a parallel version of our tree-code for the simulation of self-gravitating systems in Astrophysics. It is based on a dynamic and adaptive method for the domain decomposition, which exploits the hierarchical data arrangement used by the tree-code. It shows low computational costs for the parallelization overhead -- less than 4% of the total CPU-time in the tests done -- because the domain decomposition is performed 'on the fly' during the tree setting and the portion of the tree that is local to each processor 'enriches' itself of remote data only when they are actually needed. The performances of an implementation of the parallel code on a Cray T3E are presented and discussed. They exhibit a very good behaviour of the speedup (=15 with 16 processors and 10^5 particles) and a rather low load unbalancing (< 10% using up to 16 processors), achieving a high computation speed in the forces evaluation (>10^4 particles/sec with 8 processors).Comment: 10 pages, 8 figures, LaTeX2e, A&A class file needed (included), submitted to A&A; corrected abstract word wrappin

    Extreme scale parallel NBody algorithm with event driven constraint based execution model

    Get PDF
    Traditional scientific applications such as Computational Fluid Dynamics, Partial Differential Equations based numerical methods (like Finite Difference Methods, Finite Element Methods) achieve sufficient efficiency on state of the art high performance computing systems and have been widely studied / implemented using conventional programming models. For emerging application domains such as Graph applications scalability and efficiency is significantly constrained by the conventional systems and their supporting programming models. Furthermore technology trends like multicore, manycore, heterogeneous system architectures are introducing new challenges and possibilities. Emerging technologies are requiring a rethinking of approaches to more effectively expose the underlying parallelism to the applications and the end-users. This thesis explores the space of effective parallel execution of ephemeral graphs that are dynamically generated. The standard particle based simulation, solved using the Barnes-Hut algorithm is chosen to exemplify the dynamic workloads. In this thesis the workloads are expressed using sequential execution semantics, a conventional parallel programming model - shared memory semantics and semantics of an innovative execution model designed for efficient scalable performance towards Exascale computing called ParalleX. The main outcomes of this research are parallel processing of dynamic ephemeral workloads, enabling dynamic load balancing during runtime, and using advanced semantics for exposing parallelism in scaling constrained applications

    The Parallelism Motifs of Genomic Data Analysis

    Get PDF
    Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing
    • …
    corecore