1,211 research outputs found
Towards an Adaptive Skeleton Framework for Performance Portability
The proliferation of widely available, but very different, parallel architectures
makes the ability to deliver good parallel performance
on a range of architectures, or performance portability, highly desirable.
Irregularly-parallel problems, where the number and size
of tasks is unpredictable, are particularly challenging and require
dynamic coordination.
The paper outlines a novel approach to delivering portable parallel
performance for irregularly parallel programs. The approach
combines declarative parallelism with JIT technology, dynamic
scheduling, and dynamic transformation.
We present the design of an adaptive skeleton library, with a task
graph implementation, JIT trace costing, and adaptive transformations.
We outline the architecture of the protoype adaptive skeleton
execution framework in Pycket, describing tasks, serialisation,
and the current scheduler.We report a preliminary evaluation of the
prototype framework using 4 micro-benchmarks and a small case
study on two NUMA servers (24 and 96 cores) and a small cluster
(17 hosts, 272 cores). Key results include Pycket delivering good
sequential performance e.g. almost as fast as C for some benchmarks;
good absolute speedups on all architectures (up to 120 on
128 cores for sumEuler); and that the adaptive transformations do
improve performance
Type-driven automated program transformations and cost modelling for optimising streaming programs on FPGAs
In this paper we present a novel approach to program optimisation based on compiler-based type-driven program transformations and a fast and accurate cost/performance model for the target architecture. We target streaming programs for the problem domain of scientific computing, such as numerical weather prediction. We present our theoretical framework for type-driven program transformation, our target high-level language and intermediate representation languages and the cost model and demonstrate the effectiveness of our approach by comparison with a commercial toolchain
Contract-Based General-Purpose GPU Programming
Using GPUs as general-purpose processors has revolutionized parallel
computing by offering, for a large and growing set of algorithms, massive
data-parallelization on desktop machines. An obstacle to widespread adoption,
however, is the difficulty of programming them and the low-level control of the
hardware required to achieve good performance. This paper suggests a
programming library, SafeGPU, that aims at striking a balance between
programmer productivity and performance, by making GPU data-parallel operations
accessible from within a classical object-oriented programming language. The
solution is integrated with the design-by-contract approach, which increases
confidence in functional program correctness by embedding executable program
specifications into the program text. We show that our library leads to modular
and maintainable code that is accessible to GPGPU non-experts, while providing
performance that is comparable with hand-written CUDA code. Furthermore,
runtime contract checking turns out to be feasible, as the contracts can be
executed on the GPU
A compiler approach to scalable concurrent program design
The programmer's most powerful tool for controlling complexity in program design is abstraction. We seek to use abstraction in the design of concurrent programs, so as to
separate design decisions concerned with decomposition, communication, synchronization, mapping, granularity, and load balancing. This paper describes programming and compiler techniques intended to facilitate this design strategy. The programming techniques are based on a core programming notation with two important properties: the ability to separate concurrent programming concerns, and extensibility with reusable programmer-defined
abstractions. The compiler techniques are based on a simple transformation system together with a set of compilation transformations and portable run-time support. The
transformation system allows programmer-defined abstractions to be defined as source-to-source transformations that convert abstractions into the core notation. The same
transformation system is used to apply compilation transformations that incrementally transform the core notation toward an abstract concurrent machine. This machine can be implemented on a variety of concurrent architectures using simple run-time support.
The transformation, compilation, and run-time system techniques have been implemented and are incorporated in a public-domain program development toolkit. This
toolkit operates on a wide variety of networked workstations, multicomputers, and shared-memory
multiprocessors. It includes a program transformer, concurrent compiler, syntax checker, debugger, performance analyzer, and execution animator. A variety of substantial
applications have been developed using the toolkit, in areas such as climate modeling and fluid dynamics
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Applications
Heterogeneous High-Performance Computing
(HPC) platforms present a significant programming challenge,
especially because the key users of HPC resources are scientists,
not parallel programmers. We contend that compiler technology
has to evolve to automatically create the best program variant
by transforming a given original program. We have developed a
novel methodology based on type transformations for generating
correct-by-construction design variants, and an associated
light-weight cost model for evaluating these variants for
implementation on FPGAs. In this paper we present a key
enabler of our approach, the cost model. We discuss how we
are able to quickly derive accurate estimates of performance
and resource-utilization from the design’s representation in our
intermediate language. We show results confirming the accuracy
of our cost model by testing it on three different scientific
kernels. We conclude with a case-study that compares a solution
generated by our framework with one from a conventional
high-level synthesis tool, showing better performance and
power-efficiency using our cost model based approach
RIPL: An Efficient Image Processing DSL for FPGAs
Field programmable gate arrays (FPGAs) can accelerate image processing by
exploiting fine-grained parallelism opportunities in image operations. FPGA
language designs are often subsets or extensions of existing languages, though
these typically lack suitable hardware computation models so compiling them to
FPGAs leads to inefficient designs. Moreover, these languages lack image
processing domain specificity. Our solution is RIPL, an image processing domain
specific language (DSL) for FPGAs. It has algorithmic skeletons to express
image processing, and these are exploited to generate deep pipelines of highly
concurrent and memory-efficient image processing components.Comment: Presented at Second International Workshop on FPGAs for Software
Programmers (FSP 2015) (arXiv:1508.06320
木を用いた構造化並列プログラミング
High-level abstractions for parallel programming are still immature. Computations on complicated data structures such as pointer structures are considered as irregular algorithms. General graph structures, which irregular algorithms generally deal with, are difficult to divide and conquer. Because the divide-and-conquer paradigm is essential for load balancing in parallel algorithms and a key to parallel programming, general graphs are reasonably difficult. However, trees lead to divide-and-conquer computations by definition and are sufficiently general and powerful as a tool of programming. We therefore deal with abstractions of tree-based computations. Our study has started from Matsuzaki’s work on tree skeletons. We have improved the usability of tree skeletons by enriching their implementation aspect. Specifically, we have dealt with two issues. We first have implemented the loose coupling between skeletons and data structures and developed a flexible tree skeleton library. We secondly have implemented a parallelizer that transforms sequential recursive functions in C into parallel programs that use tree skeletons implicitly. This parallelizer hides the complicated API of tree skeletons and makes programmers to use tree skeletons with no burden. Unfortunately, the practicality of tree skeletons, however, has not been improved. On the basis of the observations from the practice of tree skeletons, we deal with two application domains: program analysis and neighborhood computation. In the domain of program analysis, compilers treat input programs as control-flow graphs (CFGs) and perform analysis on CFGs. Program analysis is therefore difficult to divide and conquer. To resolve this problem, we have developed divide-and-conquer methods for program analysis in a syntax-directed manner on the basis of Rosen’s high-level approach. Specifically, we have dealt with data-flow analysis based on Tarjan’s formalization and value-graph construction based on a functional formalization. In the domain of neighborhood computations, a primary issue is locality. A naive parallel neighborhood computation without locality enhancement causes a lot of cache misses. The divide-and-conquer paradigm is known to be useful also for locality enhancement. We therefore have applied algebraic formalizations and a tree-segmenting technique derived from tree skeletons to the locality enhancement of neighborhood computations.電気通信大学201
- …