5,467 research outputs found
Macroservers: An Execution Model for DRAM Processor-In-Memory Arrays
The emergence of semiconductor fabrication technology allowing a tight coupling between high-density DRAM and CMOS logic on the same chip has led to the important new class of Processor-In-Memory (PIM) architectures. Newer developments provide powerful parallel processing capabilities on the chip, exploiting the facility to load wide words in single memory accesses and supporting complex address manipulations in the memory. Furthermore, large arrays of PIMs can be arranged into a massively parallel architecture. In this report, we describe an object-based programming model based on the notion of a macroserver. Macroservers encapsulate a set of variables and methods; threads, spawned by the activation of methods, operate asynchronously on the variables' state space. Data distributions provide a mechanism for mapping large data structures across the memory region of a macroserver, while work distributions allow explicit control of bindings between threads and data. Both data and work distributuions are first-class objects of the model, supporting the dynamic management of data and threads in memory. This offers the flexibility required for fully exploiting the processing power and memory bandwidth of a PIM array, in particular for irregular and adaptive applications. Thread synchronization is based on atomic methods, condition variables, and futures. A special type of lightweight macroserver allows the formulation of flexible scheduling strategies for the access to resources, using a monitor-like mechanism
Introduction to the ISO specification language LOTOS
LOTOS is a specification language that has been specifically developed for the formal description of the OSI (Open Systems Interconnection) architecture, although it is applicable to distributed, concurrent systems in general. In LOTOS a system is seen as a set of processes which interact and exchange data with each other and with their environment. LOTOS is expected to become an ISO international standard by 1988
Group Communication Patterns for High Performance Computing in Scala
We developed a Functional object-oriented Parallel framework (FooPar) for
high-level high-performance computing in Scala. Central to this framework are
Distributed Memory Parallel Data structures (DPDs), i.e., collections of data
distributed in a shared nothing system together with parallel operations on
these data. In this paper, we first present FooPar's architecture and the idea
of DPDs and group communications. Then, we show how DPDs can be implemented
elegantly and efficiently in Scala based on the Traversable/Builder pattern,
unifying Functional and Object-Oriented Programming. We prove the correctness
and safety of one communication algorithm and show how specification testing
(via ScalaCheck) can be used to bridge the gap between proof and
implementation. Furthermore, we show that the group communication operations of
FooPar outperform those of the MPJ Express open source MPI-bindings for Java,
both asymptotically and empirically. FooPar has already been shown to be
capable of achieving close-to-optimal performance for dense matrix-matrix
multiplication via JNI. In this article, we present results on a parallel
implementation of the Floyd-Warshall algorithm in FooPar, achieving more than
94 % efficiency compared to the serial version on a cluster using 100 cores for
matrices of dimension 38000 x 38000
ChimpCheck: Property-Based Randomized Test Generation for Interactive Apps
We consider the problem of generating relevant execution traces to test rich
interactive applications. Rich interactive applications, such as apps on mobile
platforms, are complex stateful and often distributed systems where
sufficiently exercising the app with user-interaction (UI) event sequences to
expose defects is both hard and time-consuming. In particular, there is a
fundamental tension between brute-force random UI exercising tools, which are
fully-automated but offer low relevance, and UI test scripts, which are manual
but offer high relevance. In this paper, we consider a middle way---enabling a
seamless fusion of scripted and randomized UI testing. This fusion is
prototyped in a testing tool called ChimpCheck for programming, generating, and
executing property-based randomized test cases for Android apps. Our approach
realizes this fusion by offering a high-level, embedded domain-specific
language for defining custom generators of simulated user-interaction event
sequences. What follows is a combinator library built on industrial strength
frameworks for property-based testing (ScalaCheck) and Android testing (Android
JUnit and Espresso) to implement property-based randomized testing for Android
development. Driven by real, reported issues in open source Android apps, we
show, through case studies, how ChimpCheck enables expressing effective testing
patterns in a compact manner.Comment: 20 pages, 21 figures, Symposium on New ideas, New Paradigms, and
Reflections on Programming and Software (Onward!2017
Automatic matching of legacy code to heterogeneous APIs: An idiomatic approach
Heterogeneous accelerators often disappoint. They provide
the prospect of great performance, but only deliver it when
using vendor specific optimized libraries or domain specific
languages. This requires considerable legacy code modifications,
hindering the adoption of heterogeneous computing.
This paper develops a novel approach to automatically
detect opportunities for accelerator exploitation. We focus
on calculations that are well supported by established APIs:
sparse and dense linear algebra, stencil codes and generalized
reductions and histograms. We call them idioms and use a
custom constraint-based Idiom Description Language (IDL)
to discover them within user code. Detected idioms are then
mapped to BLAS libraries, cuSPARSE and clSPARSE and two
DSLs: Halide and Lift.
We implemented the approach in LLVM and evaluated
it on the NAS and Parboil sequential C/C++ benchmarks,
where we detect 60 idiom instances. In those cases where
idioms are a significant part of the sequential execution time,
we generate code that achieves 1.26× to over 20× speedup
on integrated and external GPUs
- …