29,154 research outputs found
Adaptive structured parallelism for computational grids
Algorithmic skeletons abstract commonly-used patterns of parallel computation, communication, and interaction. They provide top-down design composition and control inheritance throughout the whole structure. Parallel programs are expressed by interweaving parameterised skeletons analogously to the way sequential structured programs are constructed.
This design paradigm, known as structured parallelism, provides a high-level parallel programming method which allows the abstract description of programs and fosters portability. That is to say, structured parallelism requires the description of the algorithm rather than its implementation, providing a clear and consistent meaning across platforms while their associated structure depends on the particular implementation. By decoupling the structure from the meaning of a parallel program, it benefits entirely from any performance improvements in the systems infrastructure
pocl: A Performance-Portable OpenCL Implementation
OpenCL is a standard for parallel programming of heterogeneous systems. The
benefits of a common programming standard are clear; multiple vendors can
provide support for application descriptions written according to the standard,
thus reducing the program porting effort. While the standard brings the obvious
benefits of platform portability, the performance portability aspects are
largely left to the programmer. The situation is made worse due to multiple
proprietary vendor implementations with different characteristics, and, thus,
required optimization strategies.
In this paper, we propose an OpenCL implementation that is both portable and
performance portable. At its core is a kernel compiler that can be used to
exploit the data parallelism of OpenCL programs on multiple platforms with
different parallel hardware styles. The kernel compiler is modularized to
perform target-independent parallel region formation separately from the
target-specific parallel mapping of the regions to enable support for various
styles of fine-grained parallel resources such as subword SIMD extensions, SIMD
datapaths and static multi-issue. Unlike previous similar techniques that work
on the source level, the parallel region formation retains the information of
the data parallelism using the LLVM IR and its metadata infrastructure. This
data can be exploited by the later generic compiler passes for efficient
parallelization.
The proposed open source implementation of OpenCL is also platform portable,
enabling OpenCL on a wide range of architectures, both already commercialized
and on those that are still under research. The paper describes how the
portability of the implementation is achieved. Our results show that most of
the benchmarked applications when compiled using pocl were faster or close to
as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via
arxi
Loo.py: From Fortran to performance via transformation and substitution rules
A large amount of numerically-oriented code is written and is being written
in legacy languages. Much of this code could, in principle, make good use of
data-parallel throughput-oriented computer architectures. Loo.py, a
transformation-based programming system targeted at GPUs and general
data-parallel architectures, provides a mechanism for user-controlled
transformation of array programs. This transformation capability is designed to
not just apply to programs written specifically for Loo.py, but also those
imported from other languages such as Fortran. It eases the trade-off between
achieving high performance, portability, and programmability by allowing the
user to apply a large and growing family of transformations to an input
program. These transformations are expressed in and used from Python and may be
applied from a variety of settings, including a pragma-like manner from other
languages.Comment: ARRAY 2015 - 2nd ACM SIGPLAN International Workshop on Libraries,
Languages and Compilers for Array Programming (ARRAY 2015
Towards an Adaptive Skeleton Framework for Performance Portability
The proliferation of widely available, but very different, parallel architectures
makes the ability to deliver good parallel performance
on a range of architectures, or performance portability, highly desirable.
Irregularly-parallel problems, where the number and size
of tasks is unpredictable, are particularly challenging and require
dynamic coordination.
The paper outlines a novel approach to delivering portable parallel
performance for irregularly parallel programs. The approach
combines declarative parallelism with JIT technology, dynamic
scheduling, and dynamic transformation.
We present the design of an adaptive skeleton library, with a task
graph implementation, JIT trace costing, and adaptive transformations.
We outline the architecture of the protoype adaptive skeleton
execution framework in Pycket, describing tasks, serialisation,
and the current scheduler.We report a preliminary evaluation of the
prototype framework using 4 micro-benchmarks and a small case
study on two NUMA servers (24 and 96 cores) and a small cluster
(17 hosts, 272 cores). Key results include Pycket delivering good
sequential performance e.g. almost as fast as C for some benchmarks;
good absolute speedups on all architectures (up to 120 on
128 cores for sumEuler); and that the adaptive transformations do
improve performance
DPOS: A metalanguage and programming environment for parallel processors
Journal ArticleThe complexity and diversity of parallel programming languages and computer architectures hinders programmers in developing programs and greatly limits program portability. All MIMD parallel programming systems, however, address common requirements for process creation, process management, and interprocess communication. This paper describes and illustrates a structured programming system (DPOS) and graphical programming environment for generating and debugging high-level MIND parallel programs. DPOS is a metalanguage for defining parallel program networks based on the common requirements of distributed parallel computing that is portable across languages, modular, and highly flexible. The system uses the concept of stratification to separate process network creation and the control of parallelism form computational work. Individual processes are defined within the process object layer as traditional single threaded programs without parallel language constructs. Process networks and communication are defined graphically within the system layer at a high level of abstraction as recursive graphs. Communication is facilitated in DPOS by extending message passing semantics in several ways to implement highly flexible message passing constructs. DPOS processes exchange messages through bi-directional channel objects using guarded, buffered, synchronous and asynchronous communication semantics. The DPOS environment also generates source code and provides a simulation system for graphical debugging and animation of the programs in graph form
Recommended from our members
A Critique of the llc Parallel Language and Some Solutions
llc is an extension of C that has been implemented on the Dado2 machine at Columbia University. In an llc program, a single controlling processor invokes operations in parallel in subsets of a set of attached processors, which themselves can invoke parallel operations in remaining processors. llc allocates one element of a parallel object per physical processor. Removing this restriction allows programs to use parallel vectors of arbitrary size without reference to the number of processors in the machine. A program in the resulting language, mpc, contains a single main process. Each mpc process can create sets of attached processes statically or dynamically by declaring arrays of process type, and can invoke operations in parallel in these processes. mpc retains much of llc's power while adding generality, clarity, and portability
Source-to-source compilation of loop programs for manycore processors
It is widely accepted today that the end of microprocessor performance growth
based on increasing clock speeds and instruction-level parallelism (ILP)
demands new ways of exploiting transistor densities.
Manycore processors (most commonly known as
GPGPUs or simply GPUs) provide a viable solution to this performance
scaling bottleneck through large numbers of lightweight compute cores
and memory hierarchies that rely primarily on software for their
efficient utilization. The widespread proliferation of this class of
architectures today is a clear indication that exposing and managing
parallelism on a large scale as well as efficiently orchestrating
on-chip data movement is becoming an increasingly critical concern for
high-performance software development. In such a computing landscape
performance portability -- the ability to exploit the power of a variety
of manycore chips while minimizing the impact on software development
and productivity -- is perhaps one of the most important and challenging
objectives for our research community.
This thesis is about
performance portability for manycore processors and how source-to-source
compilation can help us achieve it. In particular, we show that for an
important set of loop-programs, performance portability is
attainable at low cost through compile-time polyhedral analysis and optimization
and parametric tiling for run-time performance
tuning. In other words, we propose and evaluate a source-to-source
compilation path that takes affine loop-programs as input and
produces parametrically tiled parallel code amenable to run-time tuning
across different manycore platforms and devices -- a very useful
and powerful property if we seek performance portability because it
decouples the compiler from the performance tuning process. The produced
code relies on a platform-independent run-time environment, called Avelas,
that allows us to formulate a robust and portable code generation algorithm.
Our experimental evaluation shows that Avelas induces low run-time overhead
and even substantial speed-ups for wavefront-parallel programs compared to a state-of-the-art
compile-time scheme with no run-time support. We also claim that the low overhead of Avelas is a strong
indication that it can also be effective as a general-purpose programming model
for manycore processors as we demonstrate for a set of ParBoil benchmarks.Open Acces
The President as International Leader
In this thesis, we address issues associated with programming modern heterogeneous systems while focusing on a special kind of heterogeneous systems that include multicore CPUs and one or more GPUs, called GPU-based systems.We consider the skeleton programming approach to achieve high level abstraction for efficient and portable programming of these GPU-based systemsand present our work on SkePU library which is a skeleton library for these systems. We extend the existing SkePU library with a two-dimensional (2D) data type and skeleton operations and implement several new applications using newly made skeletons. Furthermore, we consider the algorithmic choice present in SkePU and implement support to specify and automatically optimize the algorithmic choice for a skeleton call, on a given platform. To show how to achieve performance, we provide a case-study on optimized GPU-based skeleton implementation for 2D stencil computations and introduce two metrics to maximize resource utilization on a GPU. By devising a mechanism to automatically calculate these two metrics, performance can be retained while porting an application from one GPU architecture to another. Another contribution of this thesis is implementation of the runtime support for the SkePU skeleton library. This is achieved with the help of the StarPUruntime system. By this implementation,support for dynamic scheduling and load balancing for the SkePU skeleton programs is achieved. Furthermore, a capability to do hybrid executionby parallel execution on all available CPUs and GPUs in a system, even for a single skeleton invocation, is developed. SkePU initially supported only data-parallel skeletons. The first task-parallel skeleton (farm) in SkePU is implemented with support for performance-aware scheduling and hierarchical parallel execution by enabling all data parallel skeletons to be usable as tasks inside the farm construct. Experimental evaluations are carried out and presented for algorithmic selection, performance portability, dynamic scheduling and hybrid execution aspects of our work
- …