27,260 research outputs found

    Applying Graph Partitioning Methods in Measurement-based Dynamic Load Balancing

    Get PDF
    Load imbalance in an application can lead to degradation of performance and a significant drop in system utilization. Achieving the best parallel efficiency for a program requires optimal load balancing which is an NP-hard problem. This paper explores the use of graph partitioning algorithms, traditionally used for partitioning physical domains/meshes, for measurement-based dynamic load balancing of parallel applica- tions. In particular, we present repartitioning methods that consider the previous mapping to minimize dynamic migration costs. We also discuss the use of a greedy algorithm in conjunction with iterative graph partitioning algorithms to reduce the load imbalance for graphs with heavily skewed load distributions. These algorithms are implemented in a graph partitioning toolbox called SCOTCH and we use CHARM++, a migratable objects based programming model, to experiment with various load balancing scenarios. To compare with different load balancing strategies based on graph partitioners, we have implemented METIS and ZOLTAN-based load balancers in CHARM++. We demonstrate the effectiveness of the new algorithms de- veloped in SCOTCH in the context of the NAS BT solver and two micro-benchmarks. We show that SCOTCH based strategies lead to better performance compared to other existing partitioners, both in terms of the application execution time and fewer number of objects migrated.Ope

    Gunrock: GPU Graph Analytics

    Full text link
    For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock's overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries such as Ligra and Galois, and better performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing (TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance Graph Processing Library on the GPU

    Gunrock: A High-Performance Graph Processing Library on the GPU

    Full text link
    For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs have been two significant challenges for developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We evaluate Gunrock on five key graph primitives and show that Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives, and better performance than any other GPU high-level graph library.Comment: 14 pages, accepted by PPoPP'16 (removed the text repetition in the previous version v5

    An Efficient Execution Model for Reactive Stream Programs

    Get PDF
    Stream programming is a paradigm where a program is structured by a set of computational nodes connected by streams. Focusing on data moving between computational nodes via streams, this programming model fits well for applications that process long sequences of data. We call such applications reactive stream programs (RSPs) to distinguish them from stream programs with rather small and finite input data. In stream programming, concurrency is expressed implicitly via communication streams. This helps to reduce the complexity of parallel programming. For this reason, stream programming has gained popularity as a programming model for parallel platforms. However, it is also challenging to analyse and improve the performance without an understanding of the program's internal behaviour. This thesis targets an effi cient execution model for deploying RSPs on parallel platforms. This execution model includes a monitoring framework to understand the internal behaviour of RSPs, scheduling strategies for RSPs on uniform shared-memory platforms; and mapping techniques for deploying RSPs on heterogeneous distributed platforms. The foundation of the execution model is based on a study of the performance of RSPs in terms of throughput and latency. This study includes quantitative formulae for throughput and latency; and the identification of factors that influence these performance metrics. Based on the study of RSP performance, this thesis exploits characteristics of RSPs to derive effective scheduling strategies on uniform shared-memory platforms. Aiming to optimise both throughput and latency, these scheduling strategies are implemented in two heuristic-based schedulers. Both of them are designed to be centralised to provide load balancing for RSPs with dynamic behaviour as well as dynamic structures. The first one uses the notion of positive and negative data demands on each stream to determine the scheduling priorities. This scheduler is independent from the runtime system. The second one requires the runtime system to provide the position information for each computational node in the RSP; and uses that to decide the scheduling priorities. Our experiments show that both schedulers provides similar performance while being significantly better than a reference implementation without dynamic load balancing. Also based on the study of RSP performance, we present in this thesis two new heuristic partitioning algorithms which are used to map RSPs onto heterogeneous distributed platforms. These are Kernighan-Lin Adaptation (KLA) and Congestion Avoidance (CA), where the main objective is to optimise the throughput. This is a multi-parameter optimisation problem where existing graph partitioning algorithms are not applicable. Compared to the generic meta-heuristic Simulated Annealing algorithm, both proposed algorithms achieve equally good or better results. KLA is faster for small benchmarks while slower for large ones. In contrast, CA is always orders of magnitudes faster even for very large benchmarks

    Hierarchical Dynamic Loop Self-Scheduling on Distributed-Memory Systems Using an MPI+MPI Approach

    Full text link
    Computationally-intensive loops are the primary source of parallelism in scientific applications. Such loops are often irregular and a balanced execution of their loop iterations is critical for achieving high performance. However, several factors may lead to an imbalanced load execution, such as problem characteristics, algorithmic, and systemic variations. Dynamic loop self-scheduling (DLS) techniques are devised to mitigate these factors, and consequently, improve application performance. On distributed-memory systems, DLS techniques can be implemented using a hierarchical master-worker execution model and are, therefore, called hierarchical DLS techniques. These techniques self-schedule loop iterations at two levels of hardware parallelism: across and within compute nodes. Hybrid programming approaches that combine the message passing interface (MPI) with open multi-processing (OpenMP) dominate the implementation of hierarchical DLS techniques. The MPI-3 standard includes the feature of sharing memory regions among MPI processes. This feature introduced the MPI+MPI approach that simplifies the implementation of parallel scientific applications. The present work designs and implements hierarchical DLS techniques by exploiting the MPI+MPI approach. Four well-known DLS techniques are considered in the evaluation proposed herein. The results indicate certain performance advantages of the proposed approach compared to the hybrid MPI+OpenMP approach

    An Efficient Thread Mapping Strategy for Multiprogramming on Manycore Processors

    Full text link
    The emergence of multicore and manycore processors is set to change the parallel computing world. Applications are shifting towards increased parallelism in order to utilise these architectures efficiently. This leads to a situation where every application creates its desirable number of threads, based on its parallel nature and the system resources allowance. Task scheduling in such a multithreaded multiprogramming environment is a significant challenge. In task scheduling, not only the order of the execution, but also the mapping of threads to the execution resources is of a great importance. In this paper we state and discuss some fundamental rules based on results obtained from selected applications of the BOTS benchmarks on the 64-core TILEPro64 processor. We demonstrate how previously efficient mapping policies such as those of the SMP Linux scheduler become inefficient when the number of threads and cores grows. We propose a novel, low-overhead technique, a heuristic based on the amount of time spent by each CPU doing some useful work, to fairly distribute the workloads amongst the cores in a multiprogramming environment. Our novel approach could be implemented as a pragma similar to those in the new task-based OpenMP versions, or can be incorporated as a distributed thread mapping mechanism in future manycore programming frameworks. We show that our thread mapping scheme can outperform the native GNU/Linux thread scheduler in both single-programming and multiprogramming environments.Comment: ParCo Conference, Munich, Germany, 201

    GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

    Full text link
    High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which will allow graph algorithms to be expressed in a performant, succinct, composable and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in "GraphBLAST", the first high-performance linear algebra-based graph framework on NVIDIA GPUs that is open-source. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model.Comment: 50 pages, 14 figures, 14 table

    A Test Suite for High-Performance Parallel Java

    Get PDF
    The Java programming language has a number of features that make it attractive for writing high-quality, portable parallel programs. A pure object formulation, strong typing and the exception model make programs easier to create, debug, and maintain. The elegant threading provides a simple route to parallelism on shared-memory machines. Anticipating great improvements in numerical performance, this paper presents a suite of simple programs that indicate how a pure Java Navier-Stokes solver might perform. The suite includes a parallel Euler solver. We present results from a 32-processor Hewlett-Packard machine and a 4-processor Sun server. While speedup is excellent on both machines, indicating a high-quality thread scheduler, the single-processor performance needs much improvement
    • …
    corecore