4,427 research outputs found
The Glasgow Parallel Reduction Machine: Programming Shared-memory Many-core Systems using Parallel Task Composition
We present the Glasgow Parallel Reduction Machine (GPRM), a novel, flexible
framework for parallel task-composition based many-core programming. We allow
the programmer to structure programs into task code, written as C++ classes,
and communication code, written in a restricted subset of C++ with functional
semantics and parallel evaluation. In this paper we discuss the GPRM, the
virtual machine framework that enables the parallel task composition approach.
We focus the discussion on GPIR, the functional language used as the
intermediate representation of the bytecode running on the GPRM. Using examples
in this language we show the flexibility and power of our task composition
framework. We demonstrate the potential using an implementation of a merge sort
algorithm on a 64-core Tilera processor, as well as on a conventional Intel
quad-core processor and an AMD 48-core processor system. We also compare our
framework with OpenMP tasks in a parallel pointer chasing algorithm running on
the Tilera processor. Our results show that the GPRM programs outperform the
corresponding OpenMP codes on all test platforms, and can greatly facilitate
writing of parallel programs, in particular non-data parallel algorithms such
as reductions.Comment: In Proceedings PLACES 2013, arXiv:1312.221
CoreTSAR: Task Scheduling for Accelerator-aware Runtimes
Heterogeneous supercomputers that incorporate computational accelerators
such as GPUs are increasingly popular due to their high
peak performance, energy efficiency and comparatively low cost.
Unfortunately, the programming models and frameworks designed
to extract performance from all computational units still lack the
flexibility of their CPU-only counterparts. Accelerated OpenMP
improves this situation by supporting natural migration of OpenMP
code from CPUs to a GPU. However, these implementations currently
lose one of OpenMP’s best features, its flexibility: typical
OpenMP applications can run on any number of CPUs. GPU implementations
do not transparently employ multiple GPUs on a node
or a mix of GPUs and CPUs. To address these shortcomings, we
present CoreTSAR, our runtime library for dynamically scheduling
tasks across heterogeneous resources, and propose straightforward
extensions that incorporate this functionality into Accelerated
OpenMP. We show that our approach can provide nearly linear
speedup to four GPUs over only using CPUs or one GPU while
increasing the overall flexibility of Accelerated OpenMP
Parallel Sort-Based Matching for Data Distribution Management on Shared-Memory Multiprocessors
In this paper we consider the problem of identifying intersections between
two sets of d-dimensional axis-parallel rectangles. This is a common problem
that arises in many agent-based simulation studies, and is of central
importance in the context of High Level Architecture (HLA), where it is at the
core of the Data Distribution Management (DDM) service. Several realizations of
the DDM service have been proposed; however, many of them are either
inefficient or inherently sequential. These are serious limitations since
multicore processors are now ubiquitous, and DDM algorithms -- being
CPU-intensive -- could benefit from additional computing power. We propose a
parallel version of the Sort-Based Matching algorithm for shared-memory
multiprocessors. Sort-Based Matching is one of the most efficient serial
algorithms for the DDM problem, but is quite difficult to parallelize due to
data dependencies. We describe the algorithm and compute its asymptotic running
time; we complete the analysis by assessing its performance and scalability
through extensive experiments on two commodity multicore systems based on a
dual socket Intel Xeon processor, and a single socket Intel Core i7 processor.Comment: Proceedings of the 21-th ACM/IEEE International Symposium on
Distributed Simulation and Real Time Applications (DS-RT 2017). Best Paper
Award @DS-RT 201
A Comparison of some recent Task-based Parallel Programming Models
The need for parallel programming models that are simple to use and at the same time efficient for current ant future parallel platforms has led to recent attention to task-based models such as Cilk++, Intel TBB and the task concept in OpenMP version 3.0. The choice of model and implementation can have a major impact on the final performance and in order to understand some of the trade-offs we have made a quantitative study comparing four implementations of OpenMP (gcc, Intel icc, Sun studio and the research compiler Mercurium/nanos mcc), Cilk++ and Wool, a high-performance task-based library developed at SICS.
Abstract. We use microbenchmarks to characterize costs for task-creation and stealing and the Barcelona OpenMP Tasks Suite for characterizing application performance. By far Wool and Cilk++ have the lowest overhead in both spawning and stealing tasks. This is reflected in application performance when many tasks with small granularity are spawned where Cilk++ and, in particular, has the highest performance. For coarse granularity applications, the OpenMP implementations have quite similar performance as the more light-weight Cilk++ and Wool except for one application where mcc is superior thanks to a superior task scheduler.
Abstract. The OpenMP implemenations are generally not yet ready for use when the task granularity becomes very small. There is no inherent reason for this, so we expect future implementations of OpenMP to focus on this issue
- …