180 research outputs found
Toward optimised skeletons for heterogeneous parallel architecture with performance cost model
High performance architectures are increasingly heterogeneous with shared and
distributed memory components, and accelerators like GPUs. Programming such
architectures is complicated and performance portability is a major issue as the
architectures evolve. This thesis explores the potential for algorithmic skeletons
integrating a dynamically parametrised static cost model, to deliver portable
performance for mostly regular data parallel programs on heterogeneous archi-
tectures.
The rst contribution of this thesis is to address the challenges of program-
ming heterogeneous architectures by providing two skeleton-based programming
libraries: i.e. HWSkel for heterogeneous multicore clusters and GPU-HWSkel
that enables GPUs to be exploited as general purpose multi-processor devices.
Both libraries provide heterogeneous data parallel algorithmic skeletons including
hMap, hMapAll, hReduce, hMapReduce, and hMapReduceAll.
The second contribution is the development of cost models for workload dis-
tribution. First, we construct an architectural cost model (CM1) to optimise
overall processing time for HWSkel heterogeneous skeletons on a heterogeneous
system composed of networks of arbitrary numbers of nodes, each with an ar-
bitrary number of cores sharing arbitrary amounts of memory. The cost model
characterises the components of the architecture by the number of cores, clock
speed, and crucially the size of the L2 cache. Second, we extend the HWSkel cost
model (CM1) to account for GPU performance. The extended cost model (CM2)
is used in the GPU-HWSkel library to automatically nd a good distribution
for both a single heterogeneous multicore/GPU node, and clusters of heteroge-
neous multicore/GPU nodes. Experiments are carried out on three heterogeneous
multicore clusters, four heterogeneous multicore/GPU clusters, and three single
heterogeneous multicore/GPU nodes. The results of experimental evaluations for
four data parallel benchmarks, i.e. sumEuler, Image matching, Fibonacci, and
Matrix Multiplication, show that our combined heterogeneous skeletons and cost
models can make good use of resources in heterogeneous systems. Moreover using
cores together with a GPU in the same host can deliver good performance either
on a single node or on multiple node architectures
Optimizing for a Many-Core Architecture without Compromising Ease-of-Programming
Faced with nearly stagnant clock speed advances, chip manufacturers have turned to parallelism as the source for continuing performance improvements. But even though numerous parallel architectures have already been brought to market, a universally accepted methodology for programming them for general purpose applications has yet to emerge. Existing solutions tend to be hardware-specific, rendering them difficult to use for the majority of application programmers and domain experts, and not providing scalability guarantees for future generations of the hardware.
This dissertation advances the validation of the following thesis: it is possible to develop efficient general-purpose programs for a many-core platform using a model recognized for its simplicity. To prove this thesis, we refer to the eXplicit Multi-Threading (XMT) architecture designed and built at the University of Maryland. XMT is an attempt at re-inventing parallel computing with a solid theoretical foundation and an aggressive scalable design. Algorithmically, XMT is inspired by the PRAM (Parallel Random Access Machine) model and the architecture design is focused on reducing inter-task communication and synchronization overheads and providing an easy-to-program parallel model.
This thesis builds upon the existing XMT infrastructure to improve support for efficient execution with a focus on ease-of-programming. Our contributions aim at reducing the programmer's effort in developing XMT applications and improving the overall performance. More concretely, we: (1) present a work-flow guiding programmers to produce efficient parallel solutions starting from a high-level problem; (2) introduce an analytical performance model for XMT programs and provide a methodology to project running time from an implementation; (3) propose and evaluate RAP -- an improved resource-aware compiler loop prefetching algorithm targeted at fine-grained many-core architectures; we demonstrate performance improvements of up to 34.79% on average over the GCC loop prefetching implementation and up to 24.61% on average over a simple hardware prefetching scheme; and (4) implement a number of parallel benchmarks and evaluate the overall performance of XMT relative to existing serial and parallel solutions, showing speedups of up to 13.89x vs.~ a serial processor and 8.10x vs.~parallel code optimized for an existing many-core (GPU). We also discuss the implementation and optimization of the Max-Flow algorithm on XMT, a problem which is among the more advanced in terms of complexity, benchmarking and research interest in the parallel algorithms community. We demonstrate better speed-ups compared to a best serial solution than previous attempts on other parallel platforms
Exploring the Design Space of Static and Incremental Graph Connectivity Algorithms on GPUs
Connected components and spanning forest are fundamental graph algorithms due
to their use in many important applications, such as graph clustering and image
segmentation. GPUs are an ideal platform for graph algorithms due to their high
peak performance and memory bandwidth. While there exist several GPU
connectivity algorithms in the literature, many design choices have not yet
been explored. In this paper, we explore various design choices in GPU
connectivity algorithms, including sampling, linking, and tree compression, for
both the static as well as the incremental setting. Our various design choices
lead to over 300 new GPU implementations of connectivity, many of which
outperform state-of-the-art. We present an experimental evaluation, and show
that we achieve an average speedup of 2.47x speedup over existing static
algorithms. In the incremental setting, we achieve a throughput of up to 48.23
billion edges per second. Compared to state-of-the-art CPU implementations on a
72-core machine, we achieve a speedup of 8.26--14.51x for static connectivity
and 1.85--13.36x for incremental connectivity using a Tesla V100 GPU
An occam Style Communications System for UNIX Networks
This document describes the design of a communications system which provides occam style communications primitives under a Unix environment, using TCP/IP protocols, and any number of other protocols deemed suitable as underlying transport layers. The system will integrate with a low overhead scheduler/kernel without incurring significant costs to the execution of processes within the run time environment. A survey of relevant occam and occam3 features and related research is followed by a look at the Unix and TCP/IP facilities which determine our working constraints, and a description of the T9000 transputer's Virtual Channel Processor, which was instrumental in our formulation. Drawing from the information presented here, a design for the communications system is subsequently proposed. Finally, a preliminary investigation of methods for lightweight access control to shared resources in an environment which does not provide support for critical sections, semaphores, or busy waiting, is made. This is presented with relevance to mutual exclusion problems which arise within the proposed design. Future directions for the evolution of this project are discussed in conclusion
- …