389,119 research outputs found
The Glasgow Parallel Reduction Machine: Programming Shared-memory Many-core Systems using Parallel Task Composition
We present the Glasgow Parallel Reduction Machine (GPRM), a novel, flexible
framework for parallel task-composition based many-core programming. We allow
the programmer to structure programs into task code, written as C++ classes,
and communication code, written in a restricted subset of C++ with functional
semantics and parallel evaluation. In this paper we discuss the GPRM, the
virtual machine framework that enables the parallel task composition approach.
We focus the discussion on GPIR, the functional language used as the
intermediate representation of the bytecode running on the GPRM. Using examples
in this language we show the flexibility and power of our task composition
framework. We demonstrate the potential using an implementation of a merge sort
algorithm on a 64-core Tilera processor, as well as on a conventional Intel
quad-core processor and an AMD 48-core processor system. We also compare our
framework with OpenMP tasks in a parallel pointer chasing algorithm running on
the Tilera processor. Our results show that the GPRM programs outperform the
corresponding OpenMP codes on all test platforms, and can greatly facilitate
writing of parallel programs, in particular non-data parallel algorithms such
as reductions.Comment: In Proceedings PLACES 2013, arXiv:1312.221
FooPar: A Functional Object Oriented Parallel Framework in Scala
We present FooPar, an extension for highly efficient Parallel Computing in
the multi-paradigm programming language Scala. Scala offers concise and clean
syntax and integrates functional programming features. Our framework FooPar
combines these features with parallel computing techniques. FooPar is designed
modular and supports easy access to different communication backends for
distributed memory architectures as well as high performance math libraries. In
this article we use it to parallelize matrix matrix multiplication and show its
scalability by a isoefficiency analysis. In addition, results based on a
empirical analysis on two supercomputers are given. We achieve close-to-optimal
performance wrt. theoretical peak performance. Based on this result we conclude
that FooPar allows to fully access Scala's design features without suffering
from performance drops when compared to implementations purely based on C and
MPI
A Massive Data Parallel Computational Framework for Petascale/Exascale Hybrid Computer Systems
Heterogeneous systems are becoming more common on High Performance Computing
(HPC) systems. Even using tools like CUDA and OpenCL it is a non-trivial task
to obtain optimal performance on the GPU. Approaches to simplifying this task
include Merge (a library based framework for heterogeneous multi-core systems),
Zippy (a framework for parallel execution of codes on multiple GPUs), BSGP (a
new programming language for general purpose computation on the GPU) and
CUDA-lite (an enhancement to CUDA that transforms code based on annotations).
In addition, efforts are underway to improve compiler tools for automatic
parallelization and optimization of affine loop nests for GPUs and for
automatic translation of OpenMP parallelized codes to CUDA.
In this paper we present an alternative approach: a new computational
framework for the development of massively data parallel scientific codes
applications suitable for use on such petascale/exascale hybrid systems built
upon the highly scalable Cactus framework. As the first non-trivial
demonstration of its usefulness, we successfully developed a new 3D CFD code
that achieves improved performance.Comment: Parallel Computing 2011 (ParCo2011), 30 August -- 2 September 2011,
Ghent, Belgiu
Investigation of the applicability of a functional programming model to fault-tolerant parallel processing for knowledge-based systems
In a fault-tolerant parallel computer, a functional programming model can facilitate distributed checkpointing, error recovery, load balancing, and graceful degradation. Such a model has been implemented on the Draper Fault-Tolerant Parallel Processor (FTPP). When used in conjunction with the FTPP's fault detection and masking capabilities, this implementation results in a graceful degradation of system performance after faults. Three graceful degradation algorithms have been implemented and are presented. A user interface has been implemented which requires minimal cognitive overhead by the application programmer, masking such complexities as the system's redundancy, distributed nature, variable complement of processing resources, load balancing, fault occurrence and recovery. This user interface is described and its use demonstrated. The applicability of the functional programming style to the Activation Framework, a paradigm for intelligent systems, is then briefly described
FastFlow tutorial
FastFlow is a structured parallel programming framework targeting shared
memory multicores. Its layered design and the optimized implementation of the
communication mechanisms used to implement the FastFlow streaming networks
provided to the application programmer as algorithmic skeletons support the
development of efficient fine grain parallel applications. FastFlow is
available (open source) at SourceForge
(http://sourceforge.net/projects/mc-fastflow/). This work introduces FastFlow
programming techniques and points out the different ways used to parallelize
existing C/C++ code using FastFlow as a software accelerator. In short: this is
a kind of tutorial on FastFlow.Comment: 49 pages + cove
Quasar: A Programming Framework for Rapid Prototyping
We present a new programming framework, Quasar, which facilitates GPU programming. Our high-level programming language relieves the developer of all implementation details such that he can focus on the algorithm and the required accuracy. The proposed programming framework consists of a simple high-level programming language, an advanced compiler system, a runtime system and IDE.
The Quasar language is a high level scripting language with an easy to learn syntax similar to python and MATLAB. This makes Quasar well suited for fast development and prototyping. A Quasar program is first processed by a front-end compiler that automatically detects serial and parallel loops that could be accelerated by heterogeneous hardware. In a second compilation phase, a number of back-end compilers processes the output of the front-end compiler, thus generating C++, OpenCL or CUDA code. Based on the generated code the runtime system can dynamically switch between CPU and GPU. This automatic scheduling at runtime is done by analyzing the load of all devices, the memory transfer cost and the complexity of the task. This way, the programmer is relieved from complicated implementation issues that are common for programming heterogeneous hardware.
We validated the use of Quasar on a number of complex image processing and computer vision algorithms. These programs range from denoising to automated image segmentation applications. Using Quasar we get speed-up factors of 4 to over 60, depending on the application. All results were achieved using an NVIDIA GeForce M750
- …
