Search CORE

27,793 research outputs found

ARTICLE NO. PC971367 A Library-Based Approach to Task Parallelism in a Data-Parallel Language

Author: Alok Choudhary
David R. Kohr
Ian Foster
Rakesh Krishnaiyer
Publication venue
Publication date
Field of study

Pure data-parallel languages such as High Performance Fortran version 1 (HPF) do not allow efficient expression of mixed task/data-parallel computations or the coupling of separately compiled data-parallel modules. In this paper, we show how these common parallel program structures can be represented, with only minor extensions to the HPF model, by using a coordination library based on the Message Passing Interface (MPI). This library allows data-parallel tasks to exchange distributed data structures using calls to simple communication functions. We present microbenchmark results that characterize the performance of this library and that quantify the impact of optimizations that allow reuse of communication schedules in common situations. In addition, results from two-dimensional FFT, convolution, and multiblock programs demonstrate that the HPF/ MPI library can provide performance superior to that of pure HPF. We conclude that this synergistic combination of two parallel programming standards represents a useful approach to task parallelism in a data-parallel framework, increasing the range of problems addressable in HPF without requiring complex compile

CiteSeerX

Supporting high-level, high-performance parallel programming with library-driven optimization

Author: Rodrigues Christopher
Publication venue
Publication date: 01/05/2014
Field of study

Parallel programming is a demanding task for developers partly because achieving scalable parallel speedup requires drawing upon a repertoire of complex, algorithm-specific, architecture-aware programming techniques. Ideally, developers of programming tools would be able to build algorithm-specific, high-level programming interfaces that hide the complex architecture-aware details. However, it is a monumental undertaking to develop such tools from scratch, and it is challenging to provide reusable functionality for developing such tools without sacrificing the hosted interface’s performance or ease of use. In particular, to get high performance on a cluster of multicore computers without requiring developers to manually place data and computation onto processors, it is necessary to combine prior methods for shared memory parallelism with new methods for algorithm-aware distribution of computation and data across the cluster. This dissertation presents Triolet, a programming language and compiler for high-level programming of parallel loops for high-performance execution on clusters of multicore computers. Triolet adopts a simple, familiar programming interface based on traversing collections of data. By incorporating semantic knowledge of how traversals behave, Triolet achieves efficient parallel execution and communication. Moreover, Triolet’s performance on sequential loops is comparable to that of low-level C code, ranging from seven percent slower to 2.8× slower on tested benchmarks. Triolet’s design demonstrates that it is possible to decouple the design of a compiler from the implementation of parallelism without sacrificing performance or ease of use: parallel and sequential loops are implemented as library code and compiled to efficient code by an optimizing compiler that is unaware of parallelism beyond the scope of a single thread. All handling of parallel work partitioning, data partitioning, and scheduling is embodied in library code. During compilation, library code is inlined into a program and specialized to yield customized parallel loops. Experimental results from a 128-core cluster (with 8 nodes and 16 cores per node) show that loops in Triolet outperform loops in Eden, a similar high-level language. Triolet achieves significant parallel speedup over sequential C code, with performance ranging from slightly faster to 4.3× slower than manually parallelized C code on compute-intensive loops. Thus, Triolet demonstrates that a library of container traversal functions can deliver cluster-parallel performance comparable to manually parallelized C code without requiring programmers to manage parallelism. This programming approach opens the potential for future research into parallel programming frameworks

Illinois Digital Environment for Access to Learning and Scholarship Repository

Towards an Adaptive Skeleton Framework for Performance Portability

Author: Maier Patrick
Morton John Magnus
Trinder Phil
Publication venue: School of Computing Science, University of Glasgow
Publication date: 21/12/2015
Field of study

The proliferation of widely available, but very different, parallel architectures makes the ability to deliver good parallel performance on a range of architectures, or performance portability, highly desirable. Irregularly-parallel problems, where the number and size of tasks is unpredictable, are particularly challenging and require dynamic coordination. The paper outlines a novel approach to delivering portable parallel performance for irregularly parallel programs. The approach combines declarative parallelism with JIT technology, dynamic scheduling, and dynamic transformation. We present the design of an adaptive skeleton library, with a task graph implementation, JIT trace costing, and adaptive transformations. We outline the architecture of the protoype adaptive skeleton execution framework in Pycket, describing tasks, serialisation, and the current scheduler.We report a preliminary evaluation of the prototype framework using 4 micro-benchmarks and a small case study on two NUMA servers (24 and 96 cores) and a small cluster (17 hosts, 272 cores). Key results include Pycket delivering good sequential performance e.g. almost as fast as C for some benchmarks; good absolute speedups on all architectures (up to 120 on 128 cores for sumEuler); and that the adaptive transformations do improve performance

Enlighten

The Glasgow Parallel Reduction Machine: Programming Shared-memory Many-core Systems using Parallel Task Composition

Author: Tousimojarad Ashkan
Vanderbauwhede Wim
Publication venue: 'Open Publishing Association'
Publication date: 01/12/2013
Field of study

We present the Glasgow Parallel Reduction Machine (GPRM), a novel, flexible framework for parallel task-composition based many-core programming. We allow the programmer to structure programs into task code, written as C++ classes, and communication code, written in a restricted subset of C++ with functional semantics and parallel evaluation. In this paper we discuss the GPRM, the virtual machine framework that enables the parallel task composition approach. We focus the discussion on GPIR, the functional language used as the intermediate representation of the bytecode running on the GPRM. Using examples in this language we show the flexibility and power of our task composition framework. We demonstrate the potential using an implementation of a merge sort algorithm on a 64-core Tilera processor, as well as on a conventional Intel quad-core processor and an AMD 48-core processor system. We also compare our framework with OpenMP tasks in a parallel pointer chasing algorithm running on the Tilera processor. Our results show that the GPRM programs outperform the corresponding OpenMP codes on all test platforms, and can greatly facilitate writing of parallel programs, in particular non-data parallel algorithms such as reductions.Comment: In Proceedings PLACES 2013, arXiv:1312.221

arXiv.org e-Print Archive

Directory of Open Access Journals

Enlighten

A Review of Lightweight Thread Approaches for High Performance Computing

Author: Balaji Pavan
Castelló Adrián
Mayo Rafael
Peña Antonio J.
Quintana-Ortí Enrique S.
Seo Sangmin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 08/12/2016
Field of study

High-level, directive-based solutions are becoming the programming models (PMs) of the multi/many-core architectures. Several solutions relying on operating system (OS) threads perfectly work with a moderate number of cores. However, exascale systems will spawn hundreds of thousands of threads in order to exploit their massive parallel architectures and thus conventional OS threads are too heavy for that purpose. Several lightweight thread (LWT) libraries have recently appeared offering lighter mechanisms to tackle massive concurrency. In order to examine the suitability of LWTs in high-level runtimes, we develop a set of microbenchmarks consisting of commonly-found patterns in current parallel codes. Moreover, we study the semantics offered by some LWT libraries in order to expose the similarities between different LWT application programming interfaces. This study reveals that a reduced set of LWT functions can be sufficient to cover the common parallel code patterns andthat those LWT libraries perform better than OS threads-based solutions in cases where task and nested parallelism are becoming more popular with new architectures.The researchers from the Universitat Jaume I de Castelló were supported by project TIN2014-53495-R of the MINECO, the Generalitat Valenciana fellowship programme Vali+d 2015, and FEDER. This work was partially supported by the U.S. Dept. of Energy, Office of Science, Office of Advanced Scientific Computing Research (SC-21), under contract DEAC02-06CH11357. We gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC