13,020 research outputs found
Modularizing and Specifying Protocols among Threads
We identify three problems with current techniques for implementing protocols
among threads, which complicate and impair the scalability of multicore
software development: implementing synchronization, implementing coordination,
and modularizing protocols. To mend these deficiencies, we argue for the use of
domain-specific languages (DSL) based on existing models of concurrency. To
demonstrate the feasibility of this proposal, we explain how to use the model
of concurrency Reo as a high-level protocol DSL, which offers appropriate
abstractions and a natural separation of protocols and computations. We
describe a Reo-to-Java compiler and illustrate its use through examples.Comment: In Proceedings PLACES 2012, arXiv:1302.579
Compiling vector pascal to the XeonPhi
Intel's XeonPhi is a highly parallel x86 architecture chip made by Intel. It has a number of novel features which make it a particularly challenging target for the compiler writer. This paper describes the techniques used to port the Glasgow Vector Pascal Compiler to this architecture and assess its performance by comparisons of the XeonPhi with 3 other machines running the same algorithms
Remote-scope Promotion: Clarified, Rectified, and Verified
Modern accelerator programming frameworks, such as OpenCL, organise threads into work-groups. Remote-scope promotion (RSP) is a language extension recently proposed by AMD researchers that is designed to enable applications, for the first time, both to optimise for the common case of intra-work-group communication (using memory scopes to provide consistency only within a work-group) and to allow occasional inter-work-group communication (as required, for instance, to support the popular load-balancing idiom of work stealing). We present the first formal, axiomatic memory model of OpenCL extended with RSP. We have extended the Herd memory model simulator with support for OpenCL kernels that exploit RSP, and used it to discover bugs in several litmus tests and a work-stealing queue, that have been used previously in the study of RSP. We have also formalised the proposed GPU implementation of RSP. The formalisation process allowed us to identify bugs in the description of RSP that could result in well-synchronised programs experiencing memory inconsistencies. We present and prove sound a new implementation of RSP that incorporates bug fixes and requires less non-standard hardware than the original implementation. This work, a collaboration between academia and industry, clearly demonstrates how, when designing hardware support for a new concurrent language feature, the early application of formal tools and techniques can help to prevent errors, such as those we have found, from making it into silicon
Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs
Programs written in the Unified Parallel C (UPC) language can access any location of the entire local and remote address space via read/write operations. However, UPC programs that contain fine-grained shared accesses can exhibit performance degradation. One solution is to use the inspector-executor technique to coalesce fine-grained shared accesses to larger remote access operations. A straightforward implementation of the inspector executor transformation results in excessive instrumentation that hinders performance.; This paper addresses this issue and introduces various techniques that aim at reducing the generated instrumentation code: a shared-data localization transformation based on Constant-Stride Linear Memory Descriptors (CSLMADs) [S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, Cambridge Monographs on Mathematical Physics, Cambridge University Press, 2003.], the inlining of data locality checks and the usage of an index vector to aggregate the data. Finally, the paper introduces a lightweight loop code motion transformation to privatize shared scalars that were propagated through the loop body.; A performance evaluation, using up to 2048 cores of a POWER 775, explores the impact of each optimization and characterizes the overheads of UPC programs. It also shows that the presented optimizations increase performance of UPC programs up to 1.8 x their UPC hand-optimized counterpart for applications with regular accesses and up to 6.3 x for applications with irregular accesses.Peer ReviewedPostprint (author's final draft
Continuation-Passing C: compiling threads to events through continuations
In this paper, we introduce Continuation Passing C (CPC), a programming
language for concurrent systems in which native and cooperative threads are
unified and presented to the programmer as a single abstraction. The CPC
compiler uses a compilation technique, based on the CPS transform, that yields
efficient code and an extremely lightweight representation for contexts. We
provide a proof of the correctness of our compilation scheme. We show in
particular that lambda-lifting, a common compilation technique for functional
languages, is also correct in an imperative language like C, under some
conditions enforced by the CPC compiler. The current CPC compiler is mature
enough to write substantial programs such as Hekate, a highly concurrent
BitTorrent seeder. Our benchmark results show that CPC is as efficient, while
using significantly less space, as the most efficient thread libraries
available.Comment: Higher-Order and Symbolic Computation (2012). arXiv admin note:
substantial text overlap with arXiv:1202.324
- …