791 research outputs found

    Group Communication Patterns for High Performance Computing in Scala

    Full text link
    We developed a Functional object-oriented Parallel framework (FooPar) for high-level high-performance computing in Scala. Central to this framework are Distributed Memory Parallel Data structures (DPDs), i.e., collections of data distributed in a shared nothing system together with parallel operations on these data. In this paper, we first present FooPar's architecture and the idea of DPDs and group communications. Then, we show how DPDs can be implemented elegantly and efficiently in Scala based on the Traversable/Builder pattern, unifying Functional and Object-Oriented Programming. We prove the correctness and safety of one communication algorithm and show how specification testing (via ScalaCheck) can be used to bridge the gap between proof and implementation. Furthermore, we show that the group communication operations of FooPar outperform those of the MPJ Express open source MPI-bindings for Java, both asymptotically and empirically. FooPar has already been shown to be capable of achieving close-to-optimal performance for dense matrix-matrix multiplication via JNI. In this article, we present results on a parallel implementation of the Floyd-Warshall algorithm in FooPar, achieving more than 94 % efficiency compared to the serial version on a cluster using 100 cores for matrices of dimension 38000 x 38000

    Higher levels of process synchronisation

    Get PDF
    Four new synchronisation primitives (SEMAPHOREs, RESOURCEs, EVENTs and BUCKETs) were introduced in the KRoC 0.8beta release of occam for SPARC (SunOS/Solaris) and Alpha (OSF/1) UNIX workstations [1][2][3]. This paper reports on the rationale, application and implementation of two of these (SEMAPHOREs and EVENTs). Details on the other two may be found on the web [4]. The new primitives are designed to support higher-level mechanisms of SHARING between parallel processes and give us greater powers of expression. They will also let greater levels of concurrency be safely exploited from future parallel architectures, such as those providing (virtual) shared-memory. They demonstrate that occam is neutral in any debate between the merits of message-passing versus shared-memory parallelism, enabling applications to take advantage of whichever paradigm (or mixture of paradigms) is the most appropriate. The new primitives could be (but are not) implemented in terms of traditional channels, but only at the expense of increased complexity and computational overhead. The primitives are immediately useful even for uni-processors - for example, the cost of a fair ALT can be reduced from O(n) to O(1). In fact, all the operations associated with new primitives have constant space and time complexities; and the constants are very low. The KRoC release provides an Abstract Data Type interface to the primitives. However, direct use of such mechanisms still allows the user to misuse them. They must be used in the ways prescribed (in this paper and in [4]) else their semantics become unpredictable. No tool is provided to check correct usage at this level. The intention is to bind those primitives found to be useful into higher level versions of occam. Some of the primitives (e.g. SEMAPHOREs) may never themselves be made visible in the language, but may be used to implement bindings of higher-level paradigms (such as SHARED channels and BLACKBOARDs). The compiler will perform the relevant usage checking on all new language bindings, closing the security loopholes opened by raw use of the primitives. The paper closes by relating this work with the notions of virtual transputers, microcoded schedulers, object orientation and Java threads

    A User-Level Process Package for Concurrent Computing

    Get PDF
    A lightweight user-level process(ULP) package for parallel computing is described. Each ULP has its own register context, stack, data and heap space and communication with other ULPs is performed using locally synchronous, location transparent, message passing primitives. The aim of the package is to provide support for lightweight over-decomposition, optimized local communication and transparent dynamic migration. The package supports a subset of the Parallel Virtual Machine(PVM) interface[Sun90)

    High-speed detection of emergent market clustering via an unsupervised parallel genetic algorithm

    Full text link
    We implement a master-slave parallel genetic algorithm (PGA) with a bespoke log-likelihood fitness function to identify emergent clusters within price evolutions. We use graphics processing units (GPUs) to implement a PGA and visualise the results using disjoint minimal spanning trees (MSTs). We demonstrate that our GPU PGA, implemented on a commercially available general purpose GPU, is able to recover stock clusters in sub-second speed, based on a subset of stocks in the South African market. This represents a pragmatic choice for low-cost, scalable parallel computing and is significantly faster than a prototype serial implementation in an optimised C-based fourth-generation programming language, although the results are not directly comparable due to compiler differences. Combined with fast online intraday correlation matrix estimation from high frequency data for cluster identification, the proposed implementation offers cost-effective, near-real-time risk assessment for financial practitioners.Comment: 10 pages, 5 figures, 4 tables, More thorough discussion of implementatio

    The Paragraph: Design and Implementation of the STAPL Parallel Task Graph

    Get PDF
    Parallel programming is becoming mainstream due to the increased availability of multiprocessor and multicore architectures and the need to solve larger and more complex problems. Languages and tools available for the development of parallel applications are often difficult to learn and use. The Standard Template Adaptive Parallel Library (STAPL) is being developed to help programmers address these difficulties. STAPL is a parallel C++ library with functionality similar to STL, the ISO adopted C++ Standard Template Library. STAPL provides a collection of parallel pContainers for data storage and pViews that provide uniform data access operations by abstracting away the details of the pContainer data distribution. Generic pAlgorithms are written in terms of PARAGRAPHs, high level task graphs expressed as a composition of common parallel patterns. These task graphs define a set of operations on pViews as well as any ordering (i.e., dependences) on these operations that must be enforced by STAPL for a valid execution. The subject of this dissertation is the PARAGRAPH Executor, a framework that manages the runtime instantiation and execution of STAPL PARAGRAPHS. We address several challenges present when using a task graph program representation and discuss a novel approach to dependence specification which allows task graph creation and execution to proceed concurrently. This overlapping increases scalability and reduces the resources required by the PARAGRAPH Executor. We also describe the interface for task specification as well as optimizations that address issues such as data locality. We evaluate the performance of the PARAGRAPH Executor on several parallel machines including massively parallel Cray XT4 and Cray XE6 systems and an IBM Power5 cluster. Using tests including generic parallel algorithms, kernels from the NAS NPB suite, and a nuclear particle transport application written in STAPL, we demonstrate that the PARAGRAPH Executor enables STAPL to exhibit good scalability on more than 10410^4 processors

    pocl: A Performance-Portable OpenCL Implementation

    Get PDF
    OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. Our results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via arxi

    SAD progress report

    Get PDF

    Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS

    Full text link
    GROMACS is a widely used package for biomolecular simulation, and over the last two decades it has evolved from small-scale efficiency to advanced heterogeneous acceleration and multi-level parallelism targeting some of the largest supercomputers in the world. Here, we describe some of the ways we have been able to realize this through the use of parallelization on all levels, combined with a constant focus on absolute performance. Release 4.6 of GROMACS uses SIMD acceleration on a wide range of architectures, GPU offloading acceleration, and both OpenMP and MPI parallelism within and between nodes, respectively. The recent work on acceleration made it necessary to revisit the fundamental algorithms of molecular simulation, including the concept of neighborsearching, and we discuss the present and future challenges we see for exascale simulation - in particular a very fine-grained task parallelism. We also discuss the software management, code peer review and continuous integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin
    • …
    corecore