278 research outputs found
Efficient execution of Java programs on GPU
Dissertação de mestrado em Informatics EngineeringWith the overwhelming increase of demand of computational power made by fields as Big
Data, Deep Machine learning and Image processing the Graphics Processing Units (GPUs)
has been seen as a valuable tool to compute the main workload involved. Nonetheless,
these solutions have limited support for object-oriented languages that often require manual
memory handling which is an obstacle to bringing together the large community of object oriented programmers and the high-performance computing field.
In this master thesis, different memory optimizations and their impacts were studied
in a GPU Java context using Aparapi. These include solutions for different identifiable
bottlenecks of commonly used kernels exploiting its full capabilities by studying the GPU
hardware and current techniques available. These results were set against common used
C/OpenCL benchmarks and respective optimizations proving, that high-level languages can
be a solution to high-performance software demand.Com o aumento de poder computacional requisitado por campos como Big Data, Deep Machine Learning e Processamento de Imagens, as unidades de processamento gráfico (GPUs) tem sido vistas como uma ferramenta valiosa para executar a principal carga de trabalho envolvida. No entanto, esta solução tem suporte limitado para linguagens orientadas a objetos. Frequentemente estas requerem manipulação manual de memória, o que é um obstáculo para reunir a grande comunidade de programadores orientados a objetos e o campo da computação de alto desempenho. Nesta dissertação de mestrado, diferentes otimizações de memória e os seus impactos foram estudados utilizando Aparapi. As otimizações estudadas pretendem solucionar bottle-necks identificáveis em kernels frequentemente utilizados. Os resultados obtidos foram comparados com benchmarks C / OpenCL populares e as suas respectivas otimizações, provando que as linguagens de alto nível podem ser uma solução para programas que requerem computação de alto desempenho
Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability
Heterogeneous computing, which combines devices with different architectures,
is rising in popularity, and promises increased performance combined with
reduced energy consumption. OpenCL has been proposed as a standard for
programing such systems, and offers functional portability. It does, however,
suffer from poor performance portability, code tuned for one device must be
re-tuned to achieve good performance on another device. In this paper, we use
machine learning-based auto-tuning to address this problem. Benchmarks are run
on a random subset of the entire tuning parameter configuration space, and the
results are used to build an artificial neural network based model. The model
can then be used to find interesting parts of the parameter space for further
search. We evaluate our method with different benchmarks, on several devices,
including an Intel i7 3770 CPU, an Nvidia K40 GPU and an AMD Radeon HD 7970
GPU. Our model achieves a mean relative error as low as 6.1%, and is able to
find configurations as little as 1.3% worse than the global minimum.Comment: This is a pre-print version an article to be published in the
Proceedings of the 2015 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW). For personal use onl
Analysis and parameter prediction of compiler transformation for graphics processors
In the last decade graphics processors (GPUs) have been extensively used to solve computationally
intensive problems. A variety of GPU architectures by different hardware manufacturers
have been shipped in a few years. OpenCL has been introduced as the standard cross-vendor
programming framework for GPU computing. Writing and optimising OpenCL applications is
a challenging task, the programmer has to take care of several low level details. This is even
harder when the goal is to improve performance on a wide range of devices: OpenCL does not
guarantee performance portability.
In this thesis we focus on the analysis and the portability of compiler optimisations. We
describe the implementation of a portable compiler transformation: thread-coarsening. The
transformation increases the amount of work carried out by a single thread running on the GPU.
The goal is to reduce the amount of redundant instructions executed by the parallel application.
The first contribution is a technique to analyse performance improvements and degradations
given by the compiler transformation, we study the changes of hardware performance
counters when applying coarsening. In this way we identify the root causes of execution time
variations due to coarsening.
As second contribution, we study the relative performance of coarsening over multiple
input sizes. We show that the speedups given by coarsening are stable for problem sizes larger
than a threshold that we call saturation point. We exploit the existence of the saturation point
to speedup iterative compilation.
The last contribution of the work is the development of a machine learning technique that
automatically selects a coarsening configuration that improves performance. The technique is
based on an iterative model built using a neural network. The network is trained once for a
GPU model and used for several programs. To prove the flexibility of our techniques, all our
experiments have been deployed on multiple GPU models by different vendors
Machine Learning in Compiler Optimization
In the last decade, machine learning based compilation has moved from an an obscure research niche to a mainstream activity. In this article, we describe the relationship between machine learning and compiler optimisation and introduce the main concepts of features, models, training and deployment. We then provide a comprehensive survey and provide a road map for the wide variety of different research areas. We conclude with a discussion on open issues in the area and potential research directions. This paper provides both an accessible introduction to the fast moving area of machine learning based compilation and a detailed bibliography of its main achievements
A CUDA backend for Marrow and its Optimisation via Machine Learning
In the modern days, various industries like business and science deal with collecting, processing
and storing massive amounts of data. Conventional CPUs, which are optimised
for sequential performance, struggle to keep up with processing so much data, however
GPUs, designed for parallel computations, are more than up for the task.
Using GPUs for general processing has become more popular in recent years due
to the need for fast parallel processing, but developing programs that execute on the
GPU can be difficult and time consuming. Various high-level APIs that compile into
GPU programs exist, however due to the abstraction of lower level concepts and lack
of algorithm specific optimisations, it may not be possible to reach peak performance.
Optimisation specifically is an interesting problem, optimisation patterns very rarely can
be applied uniformly to different algorithms and manually tuning individual programs
is extremely time consuming.
Machine learning compilation is a concept that has gained some attention in recent
years, with good reason. The idea is to have a model trained using a machine learning
algorithm and have it make an estimate on how to optimise an input program. Predicting
the best optimisations for a program is much faster than doing it manually, in works
making use of this technique, it has shown to also provide even better optimisations.
In this thesis, we will be working with the Marrow framework and develop a CUDA
based backend for it, so that low-level GPU code may be generated. Additionally, we will
be training a machine learning model and use it to automatically optimise the CUDA
code generated from Marrow programs.Hoje em dia, várias indústrias como negócios e ciência lidam com a coleção, processamento
e armazenamento de enormes quantidades de dados. CPUs convencionais, que
são otimizados para processarem sequencialmente, têm dificuldade a processar tantos
dados eficientemente, no entanto, GPUs que são desenhados para efetuarem computações
paralelas, são mais que adequados para a tarefa.
Usar GPUs para computações genéricas tem-se tornado mais comum em anos recentes
devído à necessidade de processamento paralelo rápido, mas desenvolver programas que
executam na GPU pode ser bastante díficil e demorar demasiado tempo. Existem várias
APIs de alto nível que compilem para a GPU, mas devído à abstração de conceitos de
baixo nível e à falta de otimizações específicas para algoritmos, pode ser impossível obter
o máximo de efficiência. É interessante o problema de otimização, pois na maior parte dos
casos é impossível aplicar padróes de otimização uniformemente em diferentes algoritmos
e encontrar a melhor maneira de otimizar um programa manualmente demora bastante
tempo.
Compilação usando aprendizagem automática é um conceito que tem ficado mais popular
em tempos recentes, e por boas razões. A ideia consiste em ter um modelo treinado
através com um algoritmo de aprendizagem automática e usa-lo para ter uma estimativa
das melhor otimizações que se podem aplicar a um dado programa. Prever as melhores
otimizações com um modelo é muito mais rápido que o processo manual, e trabalhos que
usam esta técnica demonstram obter otmizações ainda melhores.
Nesta tese, vamos trabalhar com a framework Marrow e desevolver uma backend de
CUDA para a mesma, de forma a que esta possa gerar código de baixo nível para a GPU.
Para além disso, vamos treinar um modelo de aprendizagem automática e usa-lo para
otimizar código CUDA gerado a partir de programas do Marrow automáticamente
- …