24 research outputs found
Application Partitioning and Mapping Techniques for Heterogeneous Parallel Platforms
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.Parallelism has become one of the most extended paradigms used to improve performance. Legacy source code needs
to be re-written so that it can take advantage of multi-core and many-core computing devices, such as GPGPU,
FPGA, DSP or specific accelerators. However, it forces software developers to adapt applications and coding
mechanisms in order to exploit the available computing devices. It is a time consuming and error prone task that
usually results in expensive and sub-optimal parallel software.
In this work, we describe a parallel programming model, a set of annotating techniques and a static scheduling
algorithm for parallel applications. Their purpose is to simplify the task of transforming sequential legacy code
into parallel code capable of making full use of several different computing devices with the objetive of increasing
performance, lowering energy consumption and increase the productivity of the developer.European Cooperation in Science and Technology. COSTThe work presented in this paper has been partially supported by EU under the COST programme Action
IC1305, âNetwork for Sustainable Ultrascale Computing (NESUS)â The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n. 609666 and by the Spanish Ministry of Economics and Competitiveness under the grant TIN2013-41350-P
Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures
As many-core accelerators keep integrating more processing units, it becomes increasingly more difficult for a parallel application to make effective use of all available resources. An effective way for improving hardware utilization is to exploit spatial and temporal sharing of the heterogeneous processing units by multiplexing computation and communication tasks - a strategy known as heterogeneous streaming. Achieving effective heterogeneous streaming requires carefully partitioning hardware among tasks, and matching the granularity of task parallelism to the resource partition. However, finding the right resource partitioning and task granularity is extremely challenging, because there is a large number of possible solutions and the optimal solution varies across programs and datasets. This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. The model is used as a utility to quickly search for a good configuration at runtime. Instead of hand-crafting an analytical model that requires expert insights into low-level hardware details, we employ machine learning techniques to automatically learn it. We achieve this by first learning a predictive model offline using training programs. The learnt model can then be used to predict the performance of any unseen program at runtime. We apply our approach to 39 representative parallel applications and evaluate it on two representative heterogeneous many-core platforms: a CPU-XeonPhi platform and a CPU-GPU platform. Compared to the single-stream version, our approach achieves, on average, a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively. These results translate to over 93% of the performance delivered by a theoretically perfect predictor
Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems
General-purpose GPU-based systems are highly attractive, as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This article presents a compiler-based approach to automatically generate optimized OpenCL code from data parallel OpenMP programs for GPUs. A key feature of our scheme is that it leverages existing transformations, especially data transformations, to improve performance on GPU architectures and uses automatic machine learning to build a predictive model to determine if it is worthwhile running the OpenCL code on the GPU or OpenMP code on the multicore host. We applied our approach to the entire NAS parallel benchmark suite and evaluated it on distinct GPU-based systems. We achieved average (up to) speedups of 4.51Ă and 4.20Ă (143Ă and 67Ă) on Core i7/NVIDIA GeForce GTX580 and Core i7/AMD Radeon 7970 platforms, respectively, over a sequential baseline. Our approach achieves, on average, greater than 10Ă speedups over two state-of-the-art automatic GPU code generators
Parallel programming systems for scalable scientific computing
High-performance computing (HPC) systems are more powerful than ever before. However, this rise in performance brings with it greater complexity, presenting significant challenges for researchers who wish to use these systems for their scientific work. This dissertation explores the development of scalable programming solutions for scientific computing. These solutions aim to be effective across a diverse range of computing platforms, from personal desktops to advanced supercomputers.To better understand HPC systems, this dissertation begins with a literature review on exascale supercomputers, massive systems capable of performing 10Âčâž floating-point operations per second. This review combines both manual and data-driven analyses, revealing that while traditional challenges of exascale computing have largely been addressed, issues like software complexity and data volume remain. Additionally, the dissertation introduces the open-source software tool (called LitStudy) developed for this research.Next, this dissertation introduces two novel programming systems. The first system (called Rocket) is designed to scale all-versus-all algorithms to massive datasets. It features a multi-level software-based cache, a divide-and-conquer approach, hierarchical work-stealing, and asynchronous processing to maximize data reuse, exploit data locality, dynamically balance workloads, and optimize resource utilization. The second system (called Lightning) aims to scale existing single-GPU kernel functions across multiple GPUs, even on different nodes, with minimal code adjustments. Results across eight benchmarks on up to 32 GPUs show excellent scalability.The dissertation concludes by proposing a set of design principles for developing parallel programming systems for scalable scientific computing. These principles, based on lessons from this PhD research, represent significant steps forward in enabling researchers to efficiently utilize HPC systems
Type-Directed Program Synthesis and Constraint Generation for Library Portability
Fast numerical libraries have been a cornerstone of scientific computing for
decades, but this comes at a price. Programs may be tied to vendor specific
software ecosystems resulting in polluted, non-portable code. As we enter an
era of heterogeneous computing, there is an explosion in the number of
accelerator libraries required to harness specialized hardware. We need a
system that allows developers to exploit ever-changing accelerator libraries,
without over-specializing their code.
As we cannot know the behavior of future libraries ahead of time, this paper
develops a scheme that assists developers in matching their code to new
libraries, without requiring the source code for these libraries.
Furthermore, it can recover equivalent code from programs that use existing
libraries and automatically port them to new interfaces. It first uses program
synthesis to determine the meaning of a library, then maps the synthesized
description into generalized constraints which are used to search the program
for replacement opportunities to present to the developer.
We applied this approach to existing large applications from the scientific
computing and deep learning domains. Using our approach, we show speedups
ranging from 1.1 to over 10 on end to end performance when
using accelerator libraries.Comment: Accepted to PACT 201
Opportunistic acceleration of array-centric Python computation in heterogeneous environments
Dynamic scripting languages, like Python, are growing in popularity and increasingly used by non-expert programmers. These languages provide high level abstractions such as safe memory management, dynamic type handling and array bounds checking. The reduction in boilerplate code enables the concise expression of computation compared to statically typed and compiled languages. This improves programmer productivity. Increasingly, scripting languages are used by domain experts to write numerically intensive code in a variety of domains (e.g. Economics, Zoology, Archaeology and Physics). These programs are often used not just for prototyping but also in deployment. However, such managed program execution comes with a significant performance penalty arising from the interpreter having to decode and dispatch based on dynamic type checking.
Modern computer systems are increasingly equipped with accelerators such as GPUs. However, the massive speedups that can be achieved by GPU accelerators come at the cost of program complexity. Directly programming a GPU requires a deep understanding of the computational model of the underlying hardware architecture. While the complexity of such devices is abstracted by programming languages specialised for heterogeneous devices such as CUDA and OpenCL, these are dialects of the low-level C systems programming language used primarily by expert programmers.
This thesis presents the design and implementation of ALPyNA, a loop parallelisation and GPU code generation framework. A novel staged parallelisation approach is used to aggressively parallelise each execution instance of a loop nest. Loop dependence relationships that cannot be inferred statically are deferred for runtime analysis. At runtime, these dependences are augmented with runtime information obtained by introspection and the loop nest is parallelised. Parallel GPU kernels are customised to the runtime dependence graph, JIT compiled and executed.
A systematic analysis of the execution speed of loop nests is performed using 12 standard loop intensive benchmarks. The evaluation is performed on two CPUâGPU machines. One is a server grade machine while the other is a typical desktop. ALPyNAâs GPU kernels achieve orders of magnitude speedup over the baseline interpreter execution time (up to 16435x) and large speedups (up to 179.55x) over JIT compiled CPU code.
The varied performance of JIT compiled GPU code motivates the need for a sophisticated cost model to select the device providing the best speedups at runtime for varying domain sizes. This thesis describes a novel lightweight analytical cost model to determine the fastest device to execute a loop nest at runtime. The ALPyNA Cost Model (ACM) adapts to runtime dependence analysis and is parameterised on the hardware characteristics of the underlying target CPU or GPU. The cost model also takes into account the relative rate at which the interpreter is able to supply the GPU with computational work. ACM is re-targetable to other accelerator devices and only requires minimal install time profiling
Mapping parallel programs to heterogeneous multi-core systems
Heterogeneous computer systems are ubiquitous in all areas of computing, from mobile
to high-performance computing. They promise to deliver increased performance
at lower energy cost than purely homogeneous, CPU-based systems. In recent years
GPU-based heterogeneous systems have become increasingly popular. They combine
a programmable GPU with a multi-core CPU. GPUs have become flexible enough
to not only handle graphics workloads but also various kinds of general-purpose
algorithms. They are thus used as a coprocessor or accelerator alongside the CPU.
Developing applications for GPU-based heterogeneous systems involves several
challenges. Firstly, not all algorithms are equally suited for GPU computing. It is thus
important to carefully map the tasks of an application to the most suitable processor
in a system. Secondly, current frameworks for heterogeneous computing, such as
OpenCL, are low-level, requiring a thorough understanding of the hardware by the
programmer. This high barrier to entry could be lowered by automatically generating
and tuning this code from a high-level and thus more user-friendly programming
language. Both challenges are addressed in this thesis.
For the task mapping problem a machine learning-based approach is presented in
this thesis. It combines static features of the program code with runtime information
on input sizes to predict the optimal mapping of OpenCL kernels. This approach is
further extended to also take contention on the GPU into account. Both methods are
able to outperform competing mapping approaches by a significant margin.
Furthermore, this thesis develops a method for targeting GPU-based heterogeneous
systems from OpenMP, a directive-based framework for parallel computing.
OpenMP programs are translated to OpenCL and optimized for GPU performance.
At runtime a predictive model decides whether to execute the original OpenMP code
on the CPU or the generated OpenCL code on the GPU. This approach is shown to
outperform both a competing approach as well as hand-tuned code
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.The PhD Symposium was a very good opportunity for the young researchers to share information and knowledge, to
present their current research, and to discuss topics with other students in order to look for synergies and common research
topics. The idea was very successful and the assessment made by the PhD Student was very good. It also helped to
achieve one of the major goals of the NESUS Action: to establish an open European research network targeting sustainable
solutions for ultrascale computing aiming at cross fertilization among HPC, large scale distributed systems, and big
data management, training, contributing to glue disparate researchers working across different areas and provide a meeting
ground for researchers in these separate areas to exchange ideas, to identify synergies, and to pursue common activities in
research topics such as sustainable software solutions (applications and system software stack), data management, energy
efficiency, and resilience.European Cooperation in Science and Technology. COS