969 research outputs found

    Runtime support for virtual BSP computer

    Full text link

    Improving the scalability of parallel N-body applications with an event driven constraint based execution model

    Full text link
    The scalability and efficiency of graph applications are significantly constrained by conventional systems and their supporting programming models. Technology trends like multicore, manycore, and heterogeneous system architectures are introducing further challenges and possibilities for emerging application domains such as graph applications. This paper explores the space of effective parallel execution of ephemeral graphs that are dynamically generated using the Barnes-Hut algorithm to exemplify dynamic workloads. The workloads are expressed using the semantics of an Exascale computing execution model called ParalleX. For comparison, results using conventional execution model semantics are also presented. We find improved load balancing during runtime and automatic parallelism discovery improving efficiency using the advanced semantics for Exascale computing.Comment: 11 figure

    Gunrock: GPU Graph Analytics

    Full text link
    For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock's overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries such as Ligra and Galois, and better performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing (TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance Graph Processing Library on the GPU

    An occam Style Communications System for UNIX Networks

    Get PDF
    This document describes the design of a communications system which provides occam style communications primitives under a Unix environment, using TCP/IP protocols, and any number of other protocols deemed suitable as underlying transport layers. The system will integrate with a low overhead scheduler/kernel without incurring significant costs to the execution of processes within the run time environment. A survey of relevant occam and occam3 features and related research is followed by a look at the Unix and TCP/IP facilities which determine our working constraints, and a description of the T9000 transputer's Virtual Channel Processor, which was instrumental in our formulation. Drawing from the information presented here, a design for the communications system is subsequently proposed. Finally, a preliminary investigation of methods for lightweight access control to shared resources in an environment which does not provide support for critical sections, semaphores, or busy waiting, is made. This is presented with relevance to mutual exclusion problems which arise within the proposed design. Future directions for the evolution of this project are discussed in conclusion

    A multiarchitecture parallel-processing development environment

    Get PDF
    A description is given of the hardware and software of a multiprocessor test bed - the second generation Hypercluster system. The Hypercluster architecture consists of a standard hypercube distributed-memory topology, with multiprocessor shared-memory nodes. By using standard, off-the-shelf hardware, the system can be upgraded to use rapidly improving computer technology. The Hypercluster's multiarchitecture nature makes it suitable for researching parallel algorithms in computational field simulation applications (e.g., computational fluid dynamics). The dedicated test-bed environment of the Hypercluster and its custom-built software allows experiments with various parallel-processing concepts such as message passing algorithms, debugging tools, and computational 'steering'. Such research would be difficult, if not impossible, to achieve on shared, commercial systems

    Actors vs Shared Memory: two models at work on Big Data application frameworks

    Full text link
    This work aims at analyzing how two different concurrency models, namely the shared memory model and the actor model, can influence the development of applications that manage huge masses of data, distinctive of Big Data applications. The paper compares the two models by analyzing a couple of concrete projects based on the MapReduce and Bulk Synchronous Parallel algorithmic schemes. Both projects are doubly implemented on two concrete platforms: Akka Cluster and Managed X10. The result is both a conceptual comparison of models in the Big Data Analytics scenario, and an experimental analysis based on concrete executions on a cluster platform

    Scheduling Heterogeneous HPC Applications in Next-Generation Exascale Systems

    Get PDF
    Next generation HPC applications will increasingly time-share system resources with emerging workloads such as in-situ analytics, resilience tasks, runtime adaptation services and power management activities. HPC systems must carefully schedule these co-located codes in order to reduce their impact on application performance. Among the techniques traditionally used to mitigate the performance effects of time- share systems is gang scheduling. This approach, however, leverages global synchronization and time agreement mechanisms that will become hard to support as systems increase in size. Alternative performance interference mitigation approaches must be explored for future HPC systems. This dissertation evaluates the impacts of workload concurrency in future HPC systems. It uses simulation and modeling techniques to study the performance impacts of existing and emerging interference sources on a selection of HPC benchmarks, mini-applications, and applications. It also quantifies the cost and benefits of different approaches to scheduling co-located workloads, studies performance interference mitigation solutions based on gang scheduling, and examines their synchronization requirements. To do so, this dissertation presents and leverages a new Extreme Value Theory- based model to characterize interference sources, and investigate their impact on Bulk Synchronous Parallel (BSP) applications. It demonstrates how this model can be used to analyze the interference attenuation effects of alternative fine-grained OS scheduling approaches based on periodic real time schedulers. This analysis can, in turn, guide the design of those mitigation techniques by providing tools to understand the tradeoffs of selecting scheduling parameters
    • …
    corecore