25 research outputs found

    STAPL-RTS: A Runtime System for Massive Parallelism

    Get PDF
    Modern High Performance Computing (HPC) systems are complex, with deep memory hierarchies and increasing use of computational heterogeneity via accelerators. When developing applications for these platforms, programmers are faced with two bad choices. On one hand, they can explicitly manage machine resources, writing programs using low level primitives from multiple APIs (e.g., MPI+OpenMP), creating efficient but rigid, difficult to extend, and non-portable implementations. Alternatively, users can adopt higher level programming environments, often at the cost of lost performance. Our approach is to maintain the high level nature of the application without sacrificing performance by relying on the transfer of high level, application semantic knowledge between layers of the software stack at an appropriate level of abstraction and performing optimizations on a per-layer basis. In this dissertation, we present the STAPL Runtime System (STAPL-RTS), a runtime system built for portable performance, suitable for massively parallel machines. While the STAPL-RTS abstracts and virtualizes the underlying platform for portability, it uses information from the upper layers to perform the appropriate low level optimizations that restore the performance characteristics. We outline the fundamental ideas behind the design of the STAPL-RTS, such as the always distributed communication model and its asynchronous operations. Through appropriate code examples and benchmarks, we prove that high level information allows applications written on top of the STAPL-RTS to attain the performance of optimized, but ad hoc solutions. Using the STAPL library, we demonstrate how this information guides important decisions in the STAPL-RTS, such as multi-protocol communication coordination and request aggregation using established C++ programming idioms. Recognizing that nested parallelism is of increasing interest for both expressivity and performance, we present a parallel model that combines asynchronous, one-sided operations with isolated nested parallel sections. Previous approaches to nested parallelism targeted either static applications through the use of blocking, isolated sections, or dynamic applications by using asynchronous mechanisms (i.e., recursive task spawning) which come at the expense of isolation. We combine the flexibility of dynamic task creation with the isolation guarantees of the static models by allowing the creation of asynchronous, one-sided nested parallel sections that work in tandem with the more traditional, synchronous, collective nested parallelism. This allows selective, run-time customizable use of parallelism in an application, based on the input and the algorithm

    The Paragraph: Design and Implementation of the STAPL Parallel Task Graph

    Get PDF
    Parallel programming is becoming mainstream due to the increased availability of multiprocessor and multicore architectures and the need to solve larger and more complex problems. Languages and tools available for the development of parallel applications are often difficult to learn and use. The Standard Template Adaptive Parallel Library (STAPL) is being developed to help programmers address these difficulties. STAPL is a parallel C++ library with functionality similar to STL, the ISO adopted C++ Standard Template Library. STAPL provides a collection of parallel pContainers for data storage and pViews that provide uniform data access operations by abstracting away the details of the pContainer data distribution. Generic pAlgorithms are written in terms of PARAGRAPHs, high level task graphs expressed as a composition of common parallel patterns. These task graphs define a set of operations on pViews as well as any ordering (i.e., dependences) on these operations that must be enforced by STAPL for a valid execution. The subject of this dissertation is the PARAGRAPH Executor, a framework that manages the runtime instantiation and execution of STAPL PARAGRAPHS. We address several challenges present when using a task graph program representation and discuss a novel approach to dependence specification which allows task graph creation and execution to proceed concurrently. This overlapping increases scalability and reduces the resources required by the PARAGRAPH Executor. We also describe the interface for task specification as well as optimizations that address issues such as data locality. We evaluate the performance of the PARAGRAPH Executor on several parallel machines including massively parallel Cray XT4 and Cray XE6 systems and an IBM Power5 cluster. Using tests including generic parallel algorithms, kernels from the NAS NPB suite, and a nuclear particle transport application written in STAPL, we demonstrate that the PARAGRAPH Executor enables STAPL to exhibit good scalability on more than 10410^4 processors

    Parmi: a Publish/Subscribe Based Asynchronous Rmi Framework

    Get PDF
    This thesis aims to design a publish/subscribe-based asynchronous RMI (Remote Method Invocation) Framework (PARMI) residing on different machines over a network. The objectives of this thesis are: (1) explore the existing RMI model and analyze the performance of an existing RMI implementation; (2) study the related programming models for designing asynchronous RMI structure; (3) design a new PARMI framework based on publish/subscribe paradigm, realizing asynchronous communication and computation and decoupling objects in space and time; (4) evaluate the performance of the PARMI framework on the local/remote and homogeneous/heterogeneous environments. An example scientific application based on the Jacobi iteration numerical method is developed. Extensive experimental evaluation on up to 64 processors demonstrates the performance improvement using the PARMI framework.Computer Science Departmen

    Techniques for Implementing Concurrent Exceptions in C++

    Get PDF
    In recent years, concurrent programming has become more and more important. Multi-core processors and distributed programming allow the use of real-world parallelism for increased computing power. Graphical user interfaces in modern applications benefit from concurrency which allows them to stay responsive in all situations. Concurrency support has been added to many programming languages, libraries and frameworks. While exceptions are widely used in sequential programming, many concurrent programming languages and libraries provide little or no support for concurrent exception handling. This is also true for the C++ programming language, which is widely used in the industry for system programming, mobile and embedded applications, as well as high-performance computing, server and traditional desktop applications. The 2003 version of the C++ standard provides no support for concurrency, and the new C++11 standard only supports thread-based concurrency in a shared address space. Procedure and method calls across address space boundaries require support for serialisation. Such C++ libraries exist for serialisation of parameters and return values, but serialisation of exceptions is more complicated. Types of passed exceptions are not known at compile-time, and the exceptions may be thrown by third-party code. Concurrency also complicates exception handling itself. It makes it possible for several exceptions to be thrown concurrently and end up in the same process. This scenario is not supported in most current programming languages, especially C++. This thesis analyses problems in concurrent exception handling and presents mechanisms for solving them. The solution includes automatic serialisation of C++ exceptions for RPC, and exception reduction, future groups and compound exceptions for concurrent exception handling. The usability and performance of the mechanisms are measured and discussed using a use case application. Mechanisms for concurrent exception handling are provided using a library approach (i.e., without extending the language itself). Template metaprogramming is used in the solutions to automate mechanisms as much as possible. Solutions to the problems given in this thesis can be used in other programming languages as well

    Function Shipping in a Scalable Parallel Programming Model

    Get PDF
    Increasingly, a large number of scientific and technical applications exhibit dynamically generated parallelism or irregular data access patterns. These applications pose significant challenges to achieving scalable performance on large scale parallel systems. This thesis explores the advantages of using function shipping as a language level primitive to help simplify writing scalable irregular and dynamic parallel applications. Function shipping provides a mechanism to avoid exposing latency, by enabling users ship data and computation together to a remote worker for execution. In the context of the Coarray Fortran 2.0 Partitioned Global Address Space language, we implement function shipping and the finish synchronization construct, which ensures global completion of a set of shipped function instances. We demonstrate the usability and performance benefits of using function shipping with several benchmarks. Experiments on emerging supercomputers show that function shipping is useful and effective in achieving scalable performance with dynamic and irregular algorithms

    A collaborative environment for distributed Web-based CAD

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, June 1999.Includes bibliographical references (p. 67-69).by Gangadhar Konduri.S.M

    The STAPL Parallel Container Framework

    Get PDF
    The Standard Template Adaptive Parallel Library (STAPL) is a parallel programming infrastructure that extends C with support for parallelism. STAPL provides a run-time system, a collection of distributed data structures (pContainers) and parallel algorithms (pAlgorithms), and a generic methodology for extending them to provide customized functionality. Parallel containers are data structures addressing issues related to data partitioning, distribution, communication, synchronization, load balancing, and thread safety. This dissertation presents the STAPL Parallel Container Framework (PCF), which is designed to facilitate the development of generic parallel containers. We introduce a set of concepts and a methodology for assembling a pContainer from existing sequential or parallel containers without requiring the programmer to deal with concurrency or data distribution issues. The STAPL PCF provides a large number of basic data parallel structures (e.g., pArray, pList, pVector, pMatrix, pGraph, pMap, pSet). The STAPL PCF is distinguished from existing work by offering a class hierarchy and a composition mechanism which allows users to extend and customize the current container base for improved application expressivity and performance. We evaluate the performance of the STAPL pContainers on various parallel machines including a massively parallel CRAY XT4 system and an IBM P5-575 cluster. We show that the pContainer methods, generic pAlgorithms, and different applications, all provide good scalability on more than 10^4 processors
    corecore