48,733 research outputs found
DART-MPI: An MPI-based Implementation of a PGAS Runtime System
A Partitioned Global Address Space (PGAS) approach treats a distributed
system as if the memory were shared on a global level. Given such a global view
on memory, the user may program applications very much like shared memory
systems. This greatly simplifies the tasks of developing parallel applications,
because no explicit communication has to be specified in the program for data
exchange between different computing nodes. In this paper we present DART, a
runtime environment, which implements the PGAS paradigm on large-scale
high-performance computing clusters. A specific feature of our implementation
is the use of one-sided communication of the Message Passing Interface (MPI)
version 3 (i.e. MPI-3) as the underlying communication substrate. We evaluated
the performance of the implementation with several low-level kernels in order
to determine overheads and limitations in comparison to the underlying MPI-3.Comment: 11 pages, International Conference on Partitioned Global Address
Space Programming Models (PGAS14
Bridging the Gap between Enumerative and Symbolic Model Checkers
We present a method to perform symbolic state space generation for languages with existing enumerative state generators. The method is largely independent from the chosen modelling language. We validated this on three different types of languages and tools: state-based languages (PROMELA), action-based process algebras (muCRL, mCRL2), and discrete abstractions of ODEs (Maple).\ud
Only little information about the combinatorial structure of the\ud
underlying model checking problem need to be provided. The key enabling data structure is the "PINS" dependency matrix. Moreover, it can be provided gradually (more precise information yield better results).\ud
\ud
Second, in addition to symbolic reachability, the same PINS matrix contains enough information to enable new optimizations in state space generation (transition caching), again independent from the chosen modelling language. We have also based existing optimizations, like (recursive) state collapsing, on top of PINS and hint at how to support partial order reduction techniques.\ud
\ud
Third, PINS allows interfacing of existing state generators to, e.g., distributed reachability tools. Thus, besides the stated novelties, the method we propose also significantly reduces the complexity of building modular yet still efficient model checking tools.\ud
\ud
Our experiments show that we can match or even outperform existing tools by reusing their own state generators, which we have linked into an implementation of our ideas
A Theory of Partitioned Global Address Spaces
Partitioned global address space (PGAS) is a parallel programming model for
the development of applications on clusters. It provides a global address space
partitioned among the cluster nodes, and is supported in programming languages
like C, C++, and Fortran by means of APIs. In this paper we provide a formal
model for the semantics of single instruction, multiple data programs using
PGAS APIs. Our model reflects the main features of popular real-world APIs such
as SHMEM, ARMCI, GASNet, GPI, and GASPI.
A key feature of PGAS is the support for one-sided communication: a node may
directly read and write the memory located at a remote node, without explicit
synchronization with the processes running on the remote side. One-sided
communication increases performance by decoupling process synchronization from
data transfer, but requires the programmer to reason about appropriate
synchronizations between reads and writes. As a second contribution, we propose
and investigate robustness, a criterion for correct synchronization of PGAS
programs. Robustness corresponds to acyclicity of a suitable happens-before
relation defined on PGAS computations. The requirement is finer than the
classical data race freedom and rules out most false error reports.
Our main result is an algorithm for checking robustness of PGAS programs. The
algorithm makes use of two insights. Using combinatorial arguments we first
show that, if a PGAS program is not robust, then there are computations in a
certain normal form that violate happens-before acyclicity. Intuitively,
normal-form computations delay remote accesses in an ordered way. We then
devise an algorithm that checks for cyclic normal-form computations.
Essentially, the algorithm is an emptiness check for a novel automaton model
that accepts normal-form computations in streaming fashion. Altogether, we
prove the robustness problem is PSpace-complete
Execution models for mapping programs onto distributed memory parallel computers
The problem of exploiting the parallelism available in a program to efficiently employ the resources of the target machine is addressed. The problem is discussed in the context of building a mapping compiler for a distributed memory parallel machine. The paper describes using execution models to drive the process of mapping a program in the most efficient way onto a particular machine. Through analysis of the execution models for several mapping techniques for one class of programs, we show that the selection of the best technique for a particular program instance can make a significant difference in performance. On the other hand, the results of benchmarks from an implementation of a mapping compiler show that our execution models are accurate enough to select the best mapping technique for a given program
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
In modern Commercial Off-The-Shelf (COTS) multicore systems, each core can
generate many parallel memory requests at a time. The processing of these
parallel requests in the DRAM controller greatly affects the memory
interference delay experienced by running tasks on the platform. In this paper,
we model a modern COTS multicore system which has a nonblocking last-level
cache (LLC) and a DRAM controller that prioritizes reads over writes. To
minimize interference, we focus on LLC and DRAM bank partitioned systems. Based
on the model, we propose an analysis that computes a safe upper bound for the
worst-case memory interference delay. We validated our analysis on a real COTS
multicore platform with a set of carefully designed synthetic benchmarks as
well as SPEC2006 benchmarks. Evaluation results show that our analysis is more
accurately capture the worst-case memory interference delay and provides safer
upper bounds compared to a recently proposed analysis which significantly
under-estimate the delay.Comment: Technical Repor
AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture
CPU-FPGA heterogeneous architectures are attracting ever-increasing attention
in an attempt to advance computational capabilities and energy efficiency in
today's datacenters. These architectures provide programmers with the ability
to reprogram the FPGAs for flexible acceleration of many workloads.
Nonetheless, this advantage is often overshadowed by the poor programmability
of FPGAs whose programming is conventionally a RTL design practice. Although
recent advances in high-level synthesis (HLS) significantly improve the FPGA
programmability, it still leaves programmers facing the challenge of
identifying the optimal design configuration in a tremendous design space.
This paper aims to address this challenge and pave the path from software
programs towards high-quality FPGA accelerators. Specifically, we first propose
the composable, parallel and pipeline (CPP) microarchitecture as a template of
accelerator designs. Such a well-defined template is able to support efficient
accelerator designs for a broad class of computation kernels, and more
importantly, drastically reduce the design space. Also, we introduce an
analytical model to capture the performance and resource trade-offs among
different design configurations of the CPP microarchitecture, which lays the
foundation for fast design space exploration. On top of the CPP
microarchitecture and its analytical model, we develop the AutoAccel framework
to make the entire accelerator generation automated. AutoAccel accepts a
software program as an input and performs a series of code transformations
based on the result of the analytical-model-based design space exploration to
construct the desired CPP microarchitecture. Our experiments show that the
AutoAccel-generated accelerators outperform their corresponding software
implementations by an average of 72x for a broad class of computation kernels
- …