461 research outputs found
A Theory of Partitioned Global Address Spaces
Partitioned global address space (PGAS) is a parallel programming model for
the development of applications on clusters. It provides a global address space
partitioned among the cluster nodes, and is supported in programming languages
like C, C++, and Fortran by means of APIs. In this paper we provide a formal
model for the semantics of single instruction, multiple data programs using
PGAS APIs. Our model reflects the main features of popular real-world APIs such
as SHMEM, ARMCI, GASNet, GPI, and GASPI.
A key feature of PGAS is the support for one-sided communication: a node may
directly read and write the memory located at a remote node, without explicit
synchronization with the processes running on the remote side. One-sided
communication increases performance by decoupling process synchronization from
data transfer, but requires the programmer to reason about appropriate
synchronizations between reads and writes. As a second contribution, we propose
and investigate robustness, a criterion for correct synchronization of PGAS
programs. Robustness corresponds to acyclicity of a suitable happens-before
relation defined on PGAS computations. The requirement is finer than the
classical data race freedom and rules out most false error reports.
Our main result is an algorithm for checking robustness of PGAS programs. The
algorithm makes use of two insights. Using combinatorial arguments we first
show that, if a PGAS program is not robust, then there are computations in a
certain normal form that violate happens-before acyclicity. Intuitively,
normal-form computations delay remote accesses in an ordered way. We then
devise an algorithm that checks for cyclic normal-form computations.
Essentially, the algorithm is an emptiness check for a novel automaton model
that accepts normal-form computations in streaming fashion. Altogether, we
prove the robustness problem is PSpace-complete
Locality and Singularity for Store-Atomic Memory Models
Robustness is a correctness notion for concurrent programs running under
relaxed consistency models. The task is to check that the relaxed behavior
coincides (up to traces) with sequential consistency (SC). Although
computationally simple on paper (robustness has been shown to be
PSPACE-complete for TSO, PGAS, and Power), building a practical robustness
checker remains a challenge. The problem is that the various relaxations lead
to a dramatic number of computations, only few of which violate robustness.
In the present paper, we set out to reduce the search space for robustness
checkers. We focus on store-atomic consistency models and establish two
completeness results. The first result, called locality, states that a
non-robust program always contains a violating computation where only one
thread delays commands. The second result, called singularity, is even stronger
but restricted to programs without lightweight fences. It states that there is
a violating computation where a single store is delayed.
As an application of the results, we derive a linear-size source-to-source
translation of robustness to SC-reachability. It applies to general programs,
regardless of the data domain and potentially with an unbounded number of
threads and with unbounded buffers. We have implemented the translation and
verified, for the first time, PGAS algorithms in a fully automated fashion. For
TSO, our analysis outperforms existing tools
Group Communication Patterns for High Performance Computing in Scala
We developed a Functional object-oriented Parallel framework (FooPar) for
high-level high-performance computing in Scala. Central to this framework are
Distributed Memory Parallel Data structures (DPDs), i.e., collections of data
distributed in a shared nothing system together with parallel operations on
these data. In this paper, we first present FooPar's architecture and the idea
of DPDs and group communications. Then, we show how DPDs can be implemented
elegantly and efficiently in Scala based on the Traversable/Builder pattern,
unifying Functional and Object-Oriented Programming. We prove the correctness
and safety of one communication algorithm and show how specification testing
(via ScalaCheck) can be used to bridge the gap between proof and
implementation. Furthermore, we show that the group communication operations of
FooPar outperform those of the MPJ Express open source MPI-bindings for Java,
both asymptotically and empirically. FooPar has already been shown to be
capable of achieving close-to-optimal performance for dense matrix-matrix
multiplication via JNI. In this article, we present results on a parallel
implementation of the Floyd-Warshall algorithm in FooPar, achieving more than
94 % efficiency compared to the serial version on a cluster using 100 cores for
matrices of dimension 38000 x 38000
An OpenSHMEM Implementation for the Adapteva Epiphany Coprocessor
This paper reports the implementation and performance evaluation of the
OpenSHMEM 1.3 specification for the Adapteva Epiphany architecture within the
Parallella single-board computer. The Epiphany architecture exhibits massive
many-core scalability with a physically compact 2D array of RISC CPU cores and
a fast network-on-chip (NoC). While fully capable of MPMD execution, the
physical topology and memory-mapped capabilities of the core and network
translate well to Partitioned Global Address Space (PGAS) programming models
and SPMD execution with SHMEM.Comment: 14 pages, 9 figures, OpenSHMEM 2016: Third workshop on OpenSHMEM and
Related Technologie
Playing Games in the Baire Space
We solve a generalized version of Church's Synthesis Problem where a play is
given by a sequence of natural numbers rather than a sequence of bits; so a
play is an element of the Baire space rather than of the Cantor space. Two
players Input and Output choose natural numbers in alternation to generate a
play. We present a natural model of automata ("N-memory automata") equipped
with the parity acceptance condition, and we introduce also the corresponding
model of "N-memory transducers". We show that solvability of games specified by
N-memory automata (i.e., existence of a winning strategy for player Output) is
decidable, and that in this case an N-memory transducer can be constructed that
implements a winning strategy for player Output.Comment: In Proceedings Cassting'16/SynCoP'16, arXiv:1608.0017
Improving Data Locality in Distributed Processing of Multi-Channel Remote Sensing Data with Potentially Large Stencils
Distributing a multi-channel remote sensing data processing with potentially large stencils
is a difficult challenge. The goal of this master thesis was to evaluate and investigate the
performance impacts of such a processing on a distributed system and if it is possible to
improve the total execution time by exploiting data locality or memory alignments. The
thesis also gives a brief overview of the actual state of the art in remote sensing distributed
data processing and points out why distributed computing will become more important for
it in the future. For the experimental part of this thesis an application to process huge
arrays on a distributed system was implemented with DASH, a C++ Template Library for
Distributed Data Structures with Support for Hierarchical Locality for High Performance
Computing and Data-Driven Science. On the basis of the first results an optimization model
was developed which has the goal to reduce network traffic while initializing a distributed
data structure and executing computations on it with potentially large stencils. Furthermore,
a software to estimate the memory layouts with the least network communication cost for a
given multi-channel remote sensing data processing workflow was implemented. The results
of this optimization were executed and evaluated afterwards. The results show that it is
possible to improve the initialization speed of a large image by considering the brick locality
by 25%. The optimization model also generate valid decisions for the initialization of the
PGAS memory layouts. However, for a real implementation the optimization model has to
be modified to reflect implementation-dependent sources of overhead. This thesis presented
some approaches towards solving challenges of the distributed computing world that can be
used for real-world remote sensing imaging applications and contributed towards solving the
challenges of the modern Big Data world for future scientific data exploitation
- …