64 research outputs found
A new parallelisation technique for heterogeneous CPUs
Parallelization has moved in recent years into the mainstream compilers, and the demand
for parallelizing tools that can do a better job of automatic parallelization is higher than
ever. During the last decade considerable attention has been focused on developing programming
tools that support both explicit and implicit parallelism to keep up with the
power of the new multiple core technology. Yet the success to develop automatic parallelising
compilers has been limited mainly due to the complexity of the analytic process
required to exploit available parallelism and manage other parallelisation measures such
as data partitioning, alignment and synchronization.
This dissertation investigates developing a programming tool that automatically parallelises
large data structures on a heterogeneous architecture and whether a high-level programming
language compiler can use this tool to exploit implicit parallelism and make use
of the performance potential of the modern multicore technology. The work involved the
development of a fully automatic parallelisation tool, called VSM, that completely hides
the underlying details of general purpose heterogeneous architectures. The VSM implementation
provides direct and simple access for users to parallelise array operations on the
Cell’s accelerators without the need for any annotations or process directives. This work
also involved the extension of the Glasgow Vector Pascal compiler to work with the VSM
implementation as a one compiler system. The developed compiler system, which is called
VP-Cell, takes a single source code and parallelises array expressions automatically.
Several experiments were conducted using Vector Pascal benchmarks to show the validity
of the VSM approach. The VP-Cell system achieved significant runtime performance
on one accelerator as compared to the master processor’s performance and near-linear
speedups over code runs on the Cell’s accelerators. Though VSM was mainly designed for
developing parallelising compilers it also showed a considerable performance by running
C code over the Cell’s accelerators
Lightweight Modular Staging and Embedded Compilers:Abstraction without Regret for High-Level High-Performance Programming
Programs expressed in a high-level programming language need to be translated to a low-level machine dialect for execution. This translation is usually accomplished by a compiler, which is able to translate any legal program to equivalent low-level code. But for individual source programs, automatic translation does not always deliver good results: Software engineering practice demands generalization and abstraction, whereas high performance demands specialization and concretization. These goals are at odds, and compilers can only rarely translate expressive high-level programs tomodern hardware platforms in a way that makes best use of the available resources. Explicit program generation is a promising alternative to fully automatic translation. Instead of writing down the program and relying on a compiler for translation, developers write a program generator, which produces a specialized, efficient, low-level program as its output. However, developing high-quality program generators requires a very large effort that is often hard to amortize. In this thesis, we propose a hybrid design: Integrate compilers into programs so that programs can take control of the translation process, but rely on libraries of common compiler functionality for help. We present Lightweight Modular Staging (LMS), a generative programming approach that lowers the development effort significantly. LMS combines program generator logic with the generated code in a single program, using only types to distinguish the two stages of execution. Through extensive use of component technology, LMS makes a reusable and extensible compiler framework available at the library level, allowing programmers to tightly integrate domain-specific abstractions and optimizations into the generation process, with common generic optimizations provided by the framework. Compared to previous work on programgeneration, a key aspect of our design is the use of staging not only as a front-end, but also as a way to implement internal compiler passes and optimizations, many of which can be combined into powerful joint simplification passes. LMS is well suited to develop embedded domain specific languages (DSLs) and has been used to develop powerful performance-oriented DSLs for demanding domains such as machine learning, with code generation for heterogeneous platforms including GPUs. LMS has also been used to generate SQL for embedded database queries and JavaScript for web applications
GPUMap: A Transparently GPU-Accelerated Map Function
As GPGPU computing becomes more popular, it will be used to tackle a wider range of problems. However, due to the current state of GPGPU programming, programmers are typically required to be familiar with the architecture of the GPU in order to effectively program it. Fortunately, there are software packages that attempt to simplify GPGPU programming in higher-level languages such as Java and Python. However, these software packages do not attempt to abstract the GPU-acceleration process completely. Instead, they require programmers to be somewhat familiar with the traditional GPGPU programming model which involves some understanding of GPU threads and kernels. In addition, prior to using these software packages, programmers are required to transform the data they would like to operate on into arrays of primitive data. Typically, such software packages restrict the use of object-oriented programming when implementing the code to operate on this data. This thesis presents GPUMap, which is a proof-of-concept GPU-accelerated map function for Python. GPUMap aims to hide all the details of the GPU from the programmer, and allows the programmer to accelerate programs written in normal Python code that operate on arbitrarily nested objects using a majority of Python syntax. Using GPUMap, certain types of Python programs are able to be accelerated up to 100 times over normal Python code.
There are also software packages that provide simplified GPU acceleration to distributed computing frameworks such as MapReduce and Spark. Unfortunately, these packages do not provide a completely abstracted GPU programming experience, which conflicts with the purpose of the distributed computing frameworks: to abstract the underlying distributed system. This thesis also presents GPU-accelerated RDD (GPURDD), which is a type of Spark Resilient Distributed Dataset (RDD) which incorporates GPUMap into its map, filter, and foreach methods in order to allow Spark applicatons to make use of the abstracted GPU acceleration provided by GPUMap
Enhancing productivity and performance portability of opencl applications on heterogeneous systems using runtime optimizations
Initially driven by a strong need for increased computational performance in science and
engineering, heterogeneous systems have become ubiquitous and they are getting increasingly
complex. The single processor era has been replaced with multi-core processors,
which have quickly been surrounded by satellite devices aiming to increase the throughput
of the entire system. These auxiliary devices, such as Graphics Processing Units, Field Programmable
Gate Arrays or other specialized processors have very different architectures.
This puts an enormous strain on programming models and software developers to take full
advantage of the computing power at hand. Because of this diversity and the unachievable
flexibility and portability necessary to optimize for each target individually, heterogeneous
systems remain typically vastly under-utilized.
In this thesis, we explore two distinct ways to tackle this problem. Providing automated,
non intrusive methods in the form of compiler tools and implementing efficient abstractions
to automatically tune parameters for a restricted domain are two complementary
approaches investigated to better utilize compute resources in heterogeneous systems.
First, we explore a fully automated compiler based approach, where a runtime system
analyzes the computation flow of an OpenCL application and optimizes it across multiple
compute kernels. This method can be deployed on any existing application transparently
and replaces significant software engineering effort spent to tune application for a particular
system. We show that this technique achieves speedups of up to 3x over unoptimized
code and an average of 1.4x over manually optimized code for highly dynamic applications.
Second, a library based approach is designed to provide a high level abstraction for
complex problems in a specific domain, stencil computation. Using domain specific techniques,
the underlying framework optimizes the code aggressively. We show that even in
a restricted domain, automatic tuning mechanisms and robust architectural abstraction are
necessary to improve performance. Using the abstraction layer, we demonstrate strong scaling
of various applications to multiple GPUs with a speedup of up to 1.9x on two GPUs
and 3.6x on four
- …