6,707 research outputs found
Accelerating sequential programs using FastFlow and self-offloading
FastFlow is a programming environment specifically targeting cache-coherent
shared-memory multi-cores. FastFlow is implemented as a stack of C++ template
libraries built on top of lock-free (fence-free) synchronization mechanisms. In
this paper we present a further evolution of FastFlow enabling programmers to
offload part of their workload on a dynamically created software accelerator
running on unused CPUs. The offloaded function can be easily derived from
pre-existing sequential code. We emphasize in particular the effective
trade-off between human productivity and execution efficiency of the approach.Comment: 17 pages + cove
DART-MPI: An MPI-based Implementation of a PGAS Runtime System
A Partitioned Global Address Space (PGAS) approach treats a distributed
system as if the memory were shared on a global level. Given such a global view
on memory, the user may program applications very much like shared memory
systems. This greatly simplifies the tasks of developing parallel applications,
because no explicit communication has to be specified in the program for data
exchange between different computing nodes. In this paper we present DART, a
runtime environment, which implements the PGAS paradigm on large-scale
high-performance computing clusters. A specific feature of our implementation
is the use of one-sided communication of the Message Passing Interface (MPI)
version 3 (i.e. MPI-3) as the underlying communication substrate. We evaluated
the performance of the implementation with several low-level kernels in order
to determine overheads and limitations in comparison to the underlying MPI-3.Comment: 11 pages, International Conference on Partitioned Global Address
Space Programming Models (PGAS14
Process Algebras
Process Algebras are mathematically rigorous languages with well defined semantics that permit describing and verifying properties of concurrent communicating systems.
They can be seen as models of processes, regarded as agents that act and interact continuously with other similar agents and with their common environment. The agents may be real-world objects (even people), or they may be artifacts, embodied perhaps in computer hardware or software systems.
Many different approaches (operational, denotational, algebraic) are taken for describing the meaning of processes. However, the operational approach is the reference one. By relying on the so called Structural Operational Semantics (SOS), labelled transition systems are built and composed by using the different operators of the many different process algebras. Behavioral equivalences are used to abstract from unwanted details and identify those systems that react similarly to external
experiments
Static Analysis of Run-Time Errors in Embedded Real-Time Parallel C Programs
We present a static analysis by Abstract Interpretation to check for run-time
errors in parallel and multi-threaded C programs. Following our work on
Astr\'ee, we focus on embedded critical programs without recursion nor dynamic
memory allocation, but extend the analysis to a static set of threads
communicating implicitly through a shared memory and explicitly using a finite
set of mutual exclusion locks, and scheduled according to a real-time
scheduling policy and fixed priorities. Our method is thread-modular. It is
based on a slightly modified non-parallel analysis that, when analyzing a
thread, applies and enriches an abstract set of thread interferences. An
iterator then re-analyzes each thread in turn until interferences stabilize. We
prove the soundness of our method with respect to the sequential consistency
semantics, but also with respect to a reasonable weakly consistent memory
semantics. We also show how to take into account mutual exclusion and thread
priorities through a partitioning over an abstraction of the scheduler state.
We present preliminary experimental results analyzing an industrial program
with our prototype, Th\'es\'ee, and demonstrate the scalability of our
approach
Petuum: A New Platform for Distributed Machine Learning on Big Data
What is a systematic way to efficiently apply a wide spectrum of advanced ML
programs to industrial scale problems, using Big Models (up to 100s of billions
of parameters) on Big Data (up to terabytes or petabytes)? Modern
parallelization strategies employ fine-grained operations and scheduling beyond
the classic bulk-synchronous processing paradigm popularized by MapReduce, or
even specialized graph-based execution that relies on graph representations of
ML programs. The variety of approaches tends to pull systems and algorithms
design in different directions, and it remains difficult to find a universal
platform applicable to a wide range of ML programs at scale. We propose a
general-purpose framework that systematically addresses data- and
model-parallel challenges in large-scale ML, by observing that many ML programs
are fundamentally optimization-centric and admit error-tolerant,
iterative-convergent algorithmic solutions. This presents unique opportunities
for an integrative system design, such as bounded-error network synchronization
and dynamic scheduling based on ML program structure. We demonstrate the
efficacy of these system designs versus well-known implementations of modern ML
algorithms, allowing ML programs to run in much less time and at considerably
larger model sizes, even on modestly-sized compute clusters.Comment: 15 pages, 10 figures, final version in KDD 2015 under the same titl
Efficient and Reasonable Object-Oriented Concurrency
Making threaded programs safe and easy to reason about is one of the chief
difficulties in modern programming. This work provides an efficient execution
model for SCOOP, a concurrency approach that provides not only data race
freedom but also pre/postcondition reasoning guarantees between threads. The
extensions we propose influence both the underlying semantics to increase the
amount of concurrent execution that is possible, exclude certain classes of
deadlocks, and enable greater performance. These extensions are used as the
basis an efficient runtime and optimization pass that improve performance 15x
over a baseline implementation. This new implementation of SCOOP is also 2x
faster than other well-known safe concurrent languages. The measurements are
based on both coordination-intensive and data-manipulation-intensive benchmarks
designed to offer a mixture of workloads.Comment: Proceedings of the 10th Joint Meeting of the European Software
Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of
Software Engineering (ESEC/FSE '15). ACM, 201
Reusable, Extensible High-Level Data-Distribution Concept
A framework for high-level specification of data distributions in data-parallel application programs has been conceived. [As used here, distributions signifies means to express locality (more specifically, locations of specified pieces of data) in a computing system composed of many processor and memory components connected by a network.] Inasmuch as distributions exert a great effect on the performances of application programs, it is important that a distribution strategy be flexible, so that distributions can be adapted to the requirements of those programs. At the same time, for the sake of productivity in programming and execution, it is desirable that users be shielded from such error-prone, tedious details as those of communication and synchronization. As desired, the present framework enables a user to refine a distribution type and adjust it to optimize the performance of an application program and conceals, from the user, the low-level details of communication and synchronization. The framework provides for a reusable, extensible, data-distribution design, denoted the design pattern, that is independent of a concrete implementation. The design pattern abstracts over coding patterns that have been found to be commonly encountered in both manually and automatically generated distributed parallel programs. The following description of the present framework is necessarily oversimplified to fit within the space available for this article. Distributions are among the elements of a conceptual data-distribution machinery, some of the other elements being denoted domains, index sets, and data collections (see figure). Associated with each domain is one index set and one distribution. A distribution class interface (where "class" is used in the object-oriented-programming sense) includes operations that enable specification of the mapping of an index to a unit of locality. Thus, "Map(Index)" specifies a unit, while "LocalLayout(Index)" specifies the local address within that unit. The distribution class can be extended to enable specification of commonly used distributions or novel user-defined distributions. A data collection can be defined over a domain. The term "data collection" in this context signifies, more specifically, an abstraction of mappings from index sets to variables. Since the index set is distributed, the addresses of the variables are also distributed
Actors vs Shared Memory: two models at work on Big Data application frameworks
This work aims at analyzing how two different concurrency models, namely the
shared memory model and the actor model, can influence the development of
applications that manage huge masses of data, distinctive of Big Data
applications. The paper compares the two models by analyzing a couple of
concrete projects based on the MapReduce and Bulk Synchronous Parallel
algorithmic schemes. Both projects are doubly implemented on two concrete
platforms: Akka Cluster and Managed X10. The result is both a conceptual
comparison of models in the Big Data Analytics scenario, and an experimental
analysis based on concrete executions on a cluster platform
- …