600 research outputs found
Towards an Adaptive Skeleton Framework for Performance Portability
The proliferation of widely available, but very different, parallel architectures
makes the ability to deliver good parallel performance
on a range of architectures, or performance portability, highly desirable.
Irregularly-parallel problems, where the number and size
of tasks is unpredictable, are particularly challenging and require
dynamic coordination.
The paper outlines a novel approach to delivering portable parallel
performance for irregularly parallel programs. The approach
combines declarative parallelism with JIT technology, dynamic
scheduling, and dynamic transformation.
We present the design of an adaptive skeleton library, with a task
graph implementation, JIT trace costing, and adaptive transformations.
We outline the architecture of the protoype adaptive skeleton
execution framework in Pycket, describing tasks, serialisation,
and the current scheduler.We report a preliminary evaluation of the
prototype framework using 4 micro-benchmarks and a small case
study on two NUMA servers (24 and 96 cores) and a small cluster
(17 hosts, 272 cores). Key results include Pycket delivering good
sequential performance e.g. almost as fast as C for some benchmarks;
good absolute speedups on all architectures (up to 120 on
128 cores for sumEuler); and that the adaptive transformations do
improve performance
Architecture independent environment for developing engineering software on MIMD computers
Engineers are constantly faced with solving problems of increasing complexity and detail. Multiple Instruction stream Multiple Data stream (MIMD) computers have been developed to overcome the performance limitations of serial computers. The hardware architectures of MIMD computers vary considerably and are much more sophisticated than serial computers. Developing large scale software for a variety of MIMD computers is difficult and expensive. There is a need to provide tools that facilitate programming these machines. First, the issues that must be considered to develop those tools are examined. The two main areas of concern were architecture independence and data management. Architecture independent software facilitates software portability and improves the longevity and utility of the software product. It provides some form of insurance for the investment of time and effort that goes into developing the software. The management of data is a crucial aspect of solving large engineering problems. It must be considered in light of the new hardware organizations that are available. Second, the functional design and implementation of a software environment that facilitates developing architecture independent software for large engineering applications are described. The topics of discussion include: a description of the model that supports the development of architecture independent software; identifying and exploiting concurrency within the application program; data coherence; engineering data base and memory management
Aspects of practical implementations of PRAM algorithms
The PRAM is a shared memory model of parallel computation which abstracts away from inessential engineering details. It provides a very simple architecture independent model and provides a good programming environment. Theoreticians of the computer science community have proved that it is possible to emulate the theoretical PRAM model using current technology. Solutions have been found for effectively interconnecting processing elements, for routing data on these networks and for distributing the data among memory modules without hotspots. This thesis reviews this emulation and the possibilities it provides for large scale general purpose parallel computation. The emulation employs a bridging model which acts as an interface between the actual hardware and the PRAM model. We review the evidence that such a scheme crn achieve scalable parallel performance and portable parallel software and that PRAM algorithms can be optimally implemented on such practical models. In the course of this review we presented the following new results:
1. Concerning parallel approximation algorithms, we describe an NC algorithm for finding an approximation to a minimum weight perfect matching in a complete weighted graph. The algorithm is conceptually very simple and it is also the first NC-approximation algorithm for the task with a sub-linear performance ratio.
2. Concerning graph embedding, we describe dense edge-disjoint embeddings of the complete binary tree with n leaves in the following n-node communication networks: the hypercube, the de Bruijn and shuffle-exchange networks and the 2-dimcnsional mesh. In the embeddings the maximum distance from a leaf to the root of the tree is asymptotically optimally short. The embeddings facilitate efficient implementation of many PRAM algorithms on networks employing these graphs as interconnection networks.
3. Concerning bulk synchronous algorithmics, we describe scalable transportable algorithms for the following three commonly required types of computation; balanced tree computations. Fast Fourier Transforms and matrix multiplications
Recommended from our members
Towards architecture-adaptable parallel programming
There is a software gap in parallel processing. The short lifespan and small installation base of parallel architectures have made it economically infeasible to develop platform-specific parallel programming environments that deliver performance and programmability. One obvious solution is to build architecture-independent programming environments. But the architecture independence usually comes at the expense of performance, since the most efficient parallel algorithm for solving a problem often depends on the target platform. Thus, unless a parallel programming system has the ability to adapt the algorithm to the architecture, it will not be effectively machine-independent.
This research develops a new methodology for architecture-adaptable parallel programming. The methodology is built on three key ideas: (1) the use of a database of parameterized algorithmic templates to represent computable functions; (2) frame-based representation of processing environments; and (3) the use of an analytical performance prediction tool for automatic algorithm design.
This methodology pursues a problem-oriented approach to parallel processing as opposed to the traditional algorithm-oriented approach. This enables the
development of software environments with a high level of abstraction. The users state the problem to be solved using a high-level notation; they are freed from the esoteric tasks of parallel algorithm design and implementation.
This methodology has been validated in the format of a prototype of a system capable of automatically generating an efficient parallel program when presented with a well-defined problem and the description of a target platform. The use of object technology has made the system easily extensible. The templates are designed using a parallel adaptation of the well-known divide-and-conquer paradigm.
The prototype system has been used to solve several numerical problems efficiently on a wide spectrum of architectures. The target platforms include multicomputers (Thinking Machines CM-5 and Meiko CS-2), networks of workstations (IBM RS/6000s connected by FDDI), multiprocessors (Sequent Symmetry, SGI Power Challenge, and Sun SPARCServer), and a hierarchical system consisting of a cluster of multiprocessors on Myrinet
Shape-based cost analysis of skeletal parallel programs
Institute for Computing Systems ArchitectureThis work presents an automatic cost-analysis system for an implicitly parallel skeletal
programming language.
Although deducing interesting dynamic characteristics of parallel programs (and in
particular, run time) is well known to be an intractable problem in the general case, it
can be alleviated by placing restrictions upon the programs which can be expressed.
By combining two research threads, the “skeletal” and “shapely” paradigms which
take this route, we produce a completely automated, computation and communication
sensitive cost analysis system. This builds on earlier work in the area by quantifying
communication as well as computation costs, with the former being derived for the
Bulk Synchronous Parallel (BSP) model.
We present details of our shapely skeletal language and its BSP implementation strategy
together with an account of the analysis mechanism by which program behaviour
information (such as shape and cost) is statically deduced. This information can be
used at compile-time to optimise a BSP implementation and to analyse computation
and communication costs. The analysis has been implemented in Haskell. We consider
different algorithms expressed in our language for some example problems and
illustrate each BSP implementation, contrasting the analysis of their efficiency by traditional,
intuitive methods with that achieved by our cost calculator. The accuracy of
cost predictions by our cost calculator against the run time of real parallel programs is
tested experimentally.
Previous shape-based cost analysis required all elements of a vector (our nestable bulk
data structure) to have the same shape. We partially relax this strict requirement on data
structure regularity by introducing new shape expressions in our analysis framework.
We demonstrate that this allows us to achieve the first automated analysis of a complete
derivation, the well known maximum segment sum algorithm of Skillicorn and Cai
Practical Parallel Algorithms for Personalized Communication and Integer Sorting
A fundamental challenge for parallel computing is to obtain
high-level, architecture independent, algorithms which efficiently
execute on general-purpose parallel machines. With the emergence of
message passing standards such as MPI, it has become easier to design
efficient and portable parallel algorithms by making use of these
communication primitives. While existing primitives allow an
assortment of collective communication routines, they do not handle an
important communication event when most or all processors have
non-uniformly sized personalized messages to exchange with each other.
We focus in this paper on the h-relation personalized communication
whose efficient implementation will allow high performance
implementations of a large class of algorithms. While most previous
h-relation algorithms use randomization, this paper presents a new
deterministic approach for h-relation personalized communication. As
an application, we present an efficient algorithm for stable integer
sorting.
The algorithms presented in this paper have been coded in
Split-C and run on a variety of platforms, including the Thinking
Machines CM-5, IBM SP-1 and SP-2, Cray Research T3D, Meiko Scientific
CS-2, and the Intel Paragon. Our experimental results are consistent
with the theoretical analysis and illustrate the scalability and
efficiency of our algorithms across different platforms. In fact, they
seem to outperform all similar algorithms known to the authors on
these platforms.
(Also cross-referenced as UMIACS-TR-95-101.
- …