6,707 research outputs found

    Accelerating sequential programs using FastFlow and self-offloading

    Full text link
    FastFlow is a programming environment specifically targeting cache-coherent shared-memory multi-cores. FastFlow is implemented as a stack of C++ template libraries built on top of lock-free (fence-free) synchronization mechanisms. In this paper we present a further evolution of FastFlow enabling programmers to offload part of their workload on a dynamically created software accelerator running on unused CPUs. The offloaded function can be easily derived from pre-existing sequential code. We emphasize in particular the effective trade-off between human productivity and execution efficiency of the approach.Comment: 17 pages + cove

    DART-MPI: An MPI-based Implementation of a PGAS Runtime System

    Full text link
    A Partitioned Global Address Space (PGAS) approach treats a distributed system as if the memory were shared on a global level. Given such a global view on memory, the user may program applications very much like shared memory systems. This greatly simplifies the tasks of developing parallel applications, because no explicit communication has to be specified in the program for data exchange between different computing nodes. In this paper we present DART, a runtime environment, which implements the PGAS paradigm on large-scale high-performance computing clusters. A specific feature of our implementation is the use of one-sided communication of the Message Passing Interface (MPI) version 3 (i.e. MPI-3) as the underlying communication substrate. We evaluated the performance of the implementation with several low-level kernels in order to determine overheads and limitations in comparison to the underlying MPI-3.Comment: 11 pages, International Conference on Partitioned Global Address Space Programming Models (PGAS14

    Process Algebras

    Get PDF
    Process Algebras are mathematically rigorous languages with well defined semantics that permit describing and verifying properties of concurrent communicating systems. They can be seen as models of processes, regarded as agents that act and interact continuously with other similar agents and with their common environment. The agents may be real-world objects (even people), or they may be artifacts, embodied perhaps in computer hardware or software systems. Many different approaches (operational, denotational, algebraic) are taken for describing the meaning of processes. However, the operational approach is the reference one. By relying on the so called Structural Operational Semantics (SOS), labelled transition systems are built and composed by using the different operators of the many different process algebras. Behavioral equivalences are used to abstract from unwanted details and identify those systems that react similarly to external experiments

    Static Analysis of Run-Time Errors in Embedded Real-Time Parallel C Programs

    Get PDF
    We present a static analysis by Abstract Interpretation to check for run-time errors in parallel and multi-threaded C programs. Following our work on Astr\'ee, we focus on embedded critical programs without recursion nor dynamic memory allocation, but extend the analysis to a static set of threads communicating implicitly through a shared memory and explicitly using a finite set of mutual exclusion locks, and scheduled according to a real-time scheduling policy and fixed priorities. Our method is thread-modular. It is based on a slightly modified non-parallel analysis that, when analyzing a thread, applies and enriches an abstract set of thread interferences. An iterator then re-analyzes each thread in turn until interferences stabilize. We prove the soundness of our method with respect to the sequential consistency semantics, but also with respect to a reasonable weakly consistent memory semantics. We also show how to take into account mutual exclusion and thread priorities through a partitioning over an abstraction of the scheduler state. We present preliminary experimental results analyzing an industrial program with our prototype, Th\'es\'ee, and demonstrate the scalability of our approach

    Petuum: A New Platform for Distributed Machine Learning on Big Data

    Full text link
    What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization strategies employ fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or even specialized graph-based execution that relies on graph representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of ML programs at scale. We propose a general-purpose framework that systematically addresses data- and model-parallel challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant, iterative-convergent algorithmic solutions. This presents unique opportunities for an integrative system design, such as bounded-error network synchronization and dynamic scheduling based on ML program structure. We demonstrate the efficacy of these system designs versus well-known implementations of modern ML algorithms, allowing ML programs to run in much less time and at considerably larger model sizes, even on modestly-sized compute clusters.Comment: 15 pages, 10 figures, final version in KDD 2015 under the same titl

    Efficient and Reasonable Object-Oriented Concurrency

    Full text link
    Making threaded programs safe and easy to reason about is one of the chief difficulties in modern programming. This work provides an efficient execution model for SCOOP, a concurrency approach that provides not only data race freedom but also pre/postcondition reasoning guarantees between threads. The extensions we propose influence both the underlying semantics to increase the amount of concurrent execution that is possible, exclude certain classes of deadlocks, and enable greater performance. These extensions are used as the basis an efficient runtime and optimization pass that improve performance 15x over a baseline implementation. This new implementation of SCOOP is also 2x faster than other well-known safe concurrent languages. The measurements are based on both coordination-intensive and data-manipulation-intensive benchmarks designed to offer a mixture of workloads.Comment: Proceedings of the 10th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE '15). ACM, 201

    Reusable, Extensible High-Level Data-Distribution Concept

    Get PDF
    A framework for high-level specification of data distributions in data-parallel application programs has been conceived. [As used here, distributions signifies means to express locality (more specifically, locations of specified pieces of data) in a computing system composed of many processor and memory components connected by a network.] Inasmuch as distributions exert a great effect on the performances of application programs, it is important that a distribution strategy be flexible, so that distributions can be adapted to the requirements of those programs. At the same time, for the sake of productivity in programming and execution, it is desirable that users be shielded from such error-prone, tedious details as those of communication and synchronization. As desired, the present framework enables a user to refine a distribution type and adjust it to optimize the performance of an application program and conceals, from the user, the low-level details of communication and synchronization. The framework provides for a reusable, extensible, data-distribution design, denoted the design pattern, that is independent of a concrete implementation. The design pattern abstracts over coding patterns that have been found to be commonly encountered in both manually and automatically generated distributed parallel programs. The following description of the present framework is necessarily oversimplified to fit within the space available for this article. Distributions are among the elements of a conceptual data-distribution machinery, some of the other elements being denoted domains, index sets, and data collections (see figure). Associated with each domain is one index set and one distribution. A distribution class interface (where "class" is used in the object-oriented-programming sense) includes operations that enable specification of the mapping of an index to a unit of locality. Thus, "Map(Index)" specifies a unit, while "LocalLayout(Index)" specifies the local address within that unit. The distribution class can be extended to enable specification of commonly used distributions or novel user-defined distributions. A data collection can be defined over a domain. The term "data collection" in this context signifies, more specifically, an abstraction of mappings from index sets to variables. Since the index set is distributed, the addresses of the variables are also distributed

    Actors vs Shared Memory: two models at work on Big Data application frameworks

    Full text link
    This work aims at analyzing how two different concurrency models, namely the shared memory model and the actor model, can influence the development of applications that manage huge masses of data, distinctive of Big Data applications. The paper compares the two models by analyzing a couple of concrete projects based on the MapReduce and Bulk Synchronous Parallel algorithmic schemes. Both projects are doubly implemented on two concrete platforms: Akka Cluster and Managed X10. The result is both a conceptual comparison of models in the Big Data Analytics scenario, and an experimental analysis based on concrete executions on a cluster platform
    corecore