52 research outputs found
Análise de coerência de dados e otimização
Orientadores: Guido Costa Souza de Araújo, Marcio Machado PereiraDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Embora a computação heterogenea tenha permitido ganhos de desempenho (speed-ups) impressionantes, o conhecimento sobre a arquitetura dos dispositivos aceleradores para colher todos os benefícios de seu hardware ainda é algo crítico. A programação em cima dessas arquiteturas é complexa, propensa a erros e geralmente é feita por meio de lin- guagens especializadas (por exemplo, CUDA) ou bibliotecas (por exemplo, OpenCL). Em particular, para os programadores não especialistas, o custo de mover e manter dados co- erentes entre host e o dispositivo acelerador (device) pode facilmente eliminar quaisquer ganhos de desempenho alcançados pela aceleração. Esta dissertação propõe Análise de Coerência de Dados (DCA), uma simples e útil técnica de análise de fluxo de dados que determina como as variáveis são usadas pelo host/device em cada ponto do programa. Ela também introduz a Otimização de Coerência de Dados (DCO), um algoritmo baseado em DCA que: (a) usa informações das variáveis para alocar buffers OpenCL compartilhados entre o host e o device; e (b) inserir chamadas de função OpenCL apropriadas em pontos do programa de modo a minimizar o número de operações de coerência de dados. O DCO foi implementado no compilador GPUClang LLVM que é capaz de traduzir automatica- mente os loops anotados do OpenMP 4.X para kernels OpenCL, escondendo assim toda a complexidade da programação direta no OpenCL. Os resultados experimentais revelam que, enquanto GPUClang mostra desempenho de até 78x, GPUClang com DCO consegue speed-ups de até 84x em programas do benchmark Polybench rodando em um Exynos 8890 Octacore CPU com ARM Mali-T880 MP12 GPU e até 92x em um Processador dual core Intel Core i5 de 2,4 GHz equipado com uma unidade Intel Iris GPUAbstract: Although heterogeneous computing has enabled some impressive program speed-ups, knowledge about the architecture of the target device is still critical to reap the full benefits of its hardware. Programming such architectures is complex, error-prone and is usually done by means of specialized languages (e.g. CUDA) or complex function libraries (e.g. OpenCL). In particular, for non-expert programmers the cost of moving and keeping host/device data coherent can easily eliminate any performance gains achieved by accel- eration. This dissertation proposes Data Coherence Analysis (DCA) a simple and yet useful data-flow analysis technique that determines how variables are used by host/device at each program point. It also introduces Data Coherence Optimization (DCO), a DCA- based algorithm that: (a) uses variable information to allocate OpenCL shared buffers between host and devices; and (b) inserts appropriate OpenCL function calls into program points so as to minimize the number of required data coherence operations. DCO was implemented in the GPUClang LLVM compiler which is capable of automatically trans- lating OpenMP 4.X annotated loops to OpenCL kernels, thus hiding all the complexity of directly programming in OpenCL. Experimental results reveal that while GPUClang shows performance of up to 78x, GPUCLang with DCO can achieve speed-ups of up to 84x on programs from the Polybench benchmark running on an Exynos 8890 Octacore CPU with ARM Mali-T880 MP12 GPU and up to 92x on a 2.4 GHz dual-core Intel Core i5 processor equipped with an Intel Iris GPU unitMestradoCiência da ComputaçãoMestre em Ciência da Computação4719.8Funcam
Translating OpenMP Device Constructs for NVIDIA GPUs
학위논문 (석사) -- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2020. 8. 이재진.본 논문은 OpenMP 4.5 device construct를 사용해 작성된 C 프로그램을 CUDA 프로그램으로 변환하는 컴파일러와 런타임 시스템으로 구성된 프레임워크를 다룬다. 이를 위해 OpenMP C 프로그램을 CUDA 프로그램으로 변환하는 컴파일러 및 런타임 라이브러리를 제시한다. 먼저, OpenMP와 CUDA의 실행 모델, 메모리 모델 및 동기화 과정을 조망하고, source-level 컴파일러의 디자인과 정확성을 확보하기 위한 방법을 설명한다. 또한, 성능 향상을 위한 동적 구간 트리, 동적 커널 선택, 버디 할당자, UDTE와 같은 런타임 시스템 최적화 기술을 도입한다.
spec-accel 1.2 벤치마크를 이용한 실험 결과 gcc7 대비 6배 이상, mriq를 제외한 경우 2배 이상의 성능을 보여준다. 향후 본 논문의 프레임워크를 바탕으로 추가적인 런타임 최적화 기능 및 컴파일러 최적화 기술을 적용할 수 있을 것으로 기대된다.This paper deals with a framework composed of a compiler and runtime system that converts C programs written using the OpenMP 4.5 device construct into CUDA programs. To this end, we present a compiler and runtime library that converts OpenMP C programs to CUDA programs. First, we look at the execution model, memory model, and synchronization process of OpenMP and CUDA, and explain how to secure the design and accuracy of the source-level compiler. In addition, runtime system optimization techniques such as dynamic section tree, dynamic kernel selection, buddy allocator, and UDTE are introduced to improve performance.
Using the spec-accel 1.2 benchmark, it showed more than 6 times the performance of gcc7, and more than 2 times when excluding mriq. It is expected that additional runtime optimization and compiler optimization techniques can be applied based on the framework of this paper.제 1 장 서론 1
제 2 장 배경 및 관련 연구 5
제 3 장 CUDA 프로그래밍 모델 7
3.1 실행 모델 7
3.2 메모리 모델 및 동기화 9
제 4 장 OpenMP와 Device Constructs 모델 10
4.1 실행 모델 10
4.2 메모리 모델 및 동기화 14
제 5 장 Source-level 변환 컴파일러 15
5.1 루프 병렬화 15
5.2 메모리 관리 18
5.3 동기화(Synchronization) 19
제 6 장 런타임 시스템 최적화 20
6.1 커널 실행 최적화 20
6.1.1 동적 구간 트리(Dynamic Interval Tree) 20
6.1.2 동적 커널 선택(Dynamic Kernel Selection) 23
6.2 메모리 관리 최적화 26
6.2.1 버디 할당자 26
6.2.2 UDTE 27
제 7 장 실험 결과 및 분석 29
7.1 실험 환경 29
7.2 실험 결과 및 분석 31
7.2.1 gcc7과의 비교 31
7.2.2 PGI OpenACC과의 비교 34
7.2.3 논의 38
제 8 장 결론 39
참고문헌 40
Abstract 42Maste
Multilayered abstractions for partial differential equations
How do we build maintainable, robust, and performance-portable scientific
applications? This thesis argues that the answer to this software engineering
question in the context of the finite element method is through the use of
layers of Domain-Specific Languages (DSLs) to separate the various concerns in
the engineering of such codes.
Performance-portable software achieves high performance on multiple diverse
hardware platforms without source code changes. We demonstrate that finite
element solvers written in a low-level language are not performance-portable,
and therefore code must be specialised to the target architecture by a code
generation framework. A prototype compiler for finite element variational forms
that generates CUDA code is presented, and is used to explore how good
performance on many-core platforms in automatically-generated finite element
applications can be achieved. The differing code generation requirements for
multi- and many-core platforms motivates the design of an additional
abstraction, called PyOP2, that enables unstructured mesh applications to be
performance-portable.
We present a runtime code generation framework comprised of the Unified Form
Language (UFL), the FEniCS Form Compiler, and PyOP2. This toolchain separates
the succinct expression of a numerical method from the selection and
generation of efficient code for local assembly. This is further decoupled from
the selection of data formats and algorithms for efficient parallel
implementation on a specific target architecture.
We establish the successful separation of these concerns by demonstrating the
performance-portability of code generated from a single high-level source code
written in UFL across sequential C, CUDA, MPI and OpenMP targets. The
performance of the generated code exceeds the performance of comparable
alternative toolchains on multi-core architectures.Open Acces
Automatic matching of legacy code to heterogeneous APIs: An idiomatic approach
Heterogeneous accelerators often disappoint. They provide
the prospect of great performance, but only deliver it when
using vendor specific optimized libraries or domain specific
languages. This requires considerable legacy code modifications,
hindering the adoption of heterogeneous computing.
This paper develops a novel approach to automatically
detect opportunities for accelerator exploitation. We focus
on calculations that are well supported by established APIs:
sparse and dense linear algebra, stencil codes and generalized
reductions and histograms. We call them idioms and use a
custom constraint-based Idiom Description Language (IDL)
to discover them within user code. Detected idioms are then
mapped to BLAS libraries, cuSPARSE and clSPARSE and two
DSLs: Halide and Lift.
We implemented the approach in LLVM and evaluated
it on the NAS and Parboil sequential C/C++ benchmarks,
where we detect 60 idiom instances. In those cases where
idioms are a significant part of the sequential execution time,
we generate code that achieves 1.26× to over 20× speedup
on integrated and external GPUs
Tools for improving performance portability in heterogeneous environments
Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Abstract]
Parallel computing is currently partially dominated by the availability of heterogeneous
devices. These devices differ from each other in aspects such as the
instruction set they execute, the number and the type of computing devices that
they offer or the structure of their memory systems. In the last years, langnages,
libraries and extensions have appeared to allow to write a parallel code once aud
run it in a wide variety of devices, OpenCL being the most widespread solution of
this kind. However, functional portability does not imply performance portability.
This way, one of the probletns that is still open in this field is to achieve automatic
performance portability. That is, the ability to automatically tune a given code for
any device where it will be execnted so that it ill obtain a good performance. This
thesis develops three different solutions to tackle this problem. The three of them
are based on typical source-to-sonrce optimizations for heterogeneous devices. Both
the set of optimizations to apply and the way they are applied depend on different
optimization parameters, whose values have to be tuned for each specific device.
The first solution is OCLoptimizer, a source-to-source optimizer that can optimize
annotated OpenCL kemels with the help of configuration files that guide the
optimization process. The tool optimizes kernels for a specific device, and it is also
able to automate the generation of functional host codes when only a single kernel
is optimized.
The two remaining solutions are built on top of the Heterogeneous Programming
Library (HPL), a C++ framework that provides an easy and portable way to exploit
heterogeneous computing systexns. The first of these solutions uses the run-time
code generation capabilities of HPL to generate a self-optimizing version of a matrix
multiplication that can optimize itself at run-time for an spedfic device. The last solutíon is the development of a built-in just-in-time optirnizer for HPL, that can
optirnize, at run-tirne, a HPL code for an specific device. While the first two solutions
use search processes to find the best values for the optimization parameters, this Iast
alternative relies on heuristics bMed on general optirnization strategies.[Resumen]
Actualmente la computación paralela se encuentra dominada parcialmente por
los múltiples dispositivos heterogéneos disponibles. Estos dispositivos difieren entre
sí en características tales como el conjunto de instrucciones que ejecutan, el número
y tipo de unidades de computación que incluyen o la estructura de sus sistemas de
memoria. Durante los últimos años han aparecido lenguajes, librerías y extensiones
que permiten escribir una única vez la versión paralela de un código y ejecutarla en
un amplio abanico de dispositivos, siendo de entre todos ellos OpenCL la solución
más extendida. Sin embargo, la portabilidad funcional no implica portabilidad de
rendimiento. Así, uno de los grandes problemas que sigue abierto en este campo
es la automatización de la portabilidad de rendimiento, es decir, la capacidad de
adaptar automáticamente un código dado para su ejecución en cualquier dispositivo
y obtener un buen rendimiento. Esta tesis aborda este problema planteando tres
soluciones diferentes al mismo. Las tres se basan en la aplicación de optimizaciones
de código a código usadas habitualmente en dispositivos heterogéneos. Tanto el
conjunto de optimizaciones a aplicar como la forma de aplicarlas dependen de varios
parámetros de optimización, cuyos valores han de ser ajustados para cada dispositivo
concreto.
La primera solución planteada es OCLoptirnizer, un optimizador de código a
código que a partir de kernels OpenCL anotados y ficheros de configuración como
apoyo, obtiene versiones optimizada de dichos kernels para un dispositivo concreto.
Además, cuando el kernel a optimizar es único, automatiza la generación de un
código de host funcional para ese kernel.
Las otras dos soluciones han sido implementadas utilizando Heterogeneous Prograrnming
LibranJ (HPL), una librería C++ que permite programar sistemas heterogéneos de forma fácil y portable. La primera de estas soluciones explota las
capacidades de generación de código en tiempo de ejecución de HPL para generar
versiones de un producto de matrices que se adaptan automáticamente en tiempo
de ejecución a las características de un dispositivo concreto. La última solución consiste
en el desarrollo e incorporación a HPL de un optimizador al vuelo, de fonna
que se puedan obtener en tiempo de ejecución versiones optimizadas de un código
HPL para un dispositivo dado. Mientras las dos primeras soluciones usan procesos
de búsqueda para encontrar los mejores valores para los parámetros de optimización,
esta última altemativa se basa para ello en heurísticas definidas a partir de
recomendaciones generales de optimización.[Resumo]
Actualmente a computación paralela atópase dominada parcialmente polos múltiples
dispositivos heteroxéneos dispoñibles. Estes dispositivos difiren entre si en características
tales como o conxunto de instruccións que executan, o número e tipo
de unidades de computación que inclúen ou a estrutura dos seus sistemas de mem~
ría. Nos últimos anos apareceron linguaxes, bibliotecas e extensións que permiten
escribir unha soa vez a versión paralela dun código e executala nun amplio abano de
dispositivos, senda de entre todos eles OpenCL a solución máis extendida. Porén, a
portabilidade funcional non implica portabilidade de rendemento. Deste xeito, uns
dos grandes problemas que segue aberto neste campo é a automatización da portabilidade
de rendemento, isto é, a capacidade de adaptar automaticamente un código
dado para a súa execución en calquera dispositivo e obter un bo rendemento. Esta
tese aborda este problema propondo tres solucións diferentes. As tres están baseadas
na aplicación de optimizacións de código a código usadas habitualmente en disp~
sitivos heteroxéneos. Tanto o conxunto de optimizacións a aplicar como a forma de
aplicalas dependen de varios parámetros de optimización para os que é preciso fixar
determinados valores en función do dispositivo concreto.
A primeira solución pro posta é OCLoptirnizer, un optimizador de código a código
que partindo de kemels OpenCL anotados e ficheiros de configuración de apoio,
obtén versións optimizadas dos devanditos kernels para un dispositivo concreto.
Amais, cando o kernel a optimizaré único, tarnén automatiza a xeración dun código
de host funcional para ese kernel.
As outras dúas solucións foron implementadas utilizando Heterogeneous Programming
Library (HPL), unha biblioteca C++ que permite programar sistemas
heteroxéneos de xeito fácil e portable. A primeira destas solucións explota as capacidades de xeración de código en tempo de execución de HPL para xerar versións
dun produto de matrices que se adaptan automaticamente ás características dun
dispositivo concreto. A última solución consiste no deseuvolvemento e incorporación
a HPL dun optimizador capaz de obter en tiempo de execución versións optimizada<;
dun código HPL para un dispositivo dado. Mentres as dúas primeiras solucións usan
procesos de procura para atopar os mellares valores para os parámetros de optimización,
esta última alternativa baséase para iso en heurísticas definidas a partir de
recomendacións xerais de optimización
Heterogeneous parallel virtual machine: A portable program representation and compiler for performance and energy optimizations on heterogeneous parallel systems
Programming heterogeneous parallel systems, such as the SoCs (System-on-Chip) on mobile and edge devices is extremely difficult; the diverse parallel hardware they contain exposes vastly different hardware instruction sets, parallelism models and memory systems. Moreover, a wide range of diverse hardware and software approximation techniques are available for applications targeting heterogeneous SoCs, further exacerbating the programmability challenges. In this thesis, we alleviate the programmability challenges of such systems using flexible compiler intermediate representation solutions, in order to benefit from the performance and superior energy efficiency of heterogeneous systems.
First, we develop Heterogeneous Parallel Virtual Machine (HPVM), a parallel program representation for heterogeneous systems, designed to enable functional and performance portability across popular parallel hardware. HPVM is based on a hierarchical dataflow graph with side effects. HPVM successfully supports three important capabilities for programming heterogeneous systems: a compiler intermediate representation (IR), a virtual instruction set (ISA), and a basis for runtime scheduling. We use the HPVM representation to implement an HPVM prototype, defining the HPVM IR as an extension of the Low Level Virtual Machine (LLVM) IR. Our results show comparable performance with optimized OpenCL kernels for the target hardware from a single HPVM representation using translators from HPVM virtual ISA to native code, IR optimizations operating directly on the HPVM representation, and the capability for supporting flexible runtime scheduling schemes from a single HPVM representation.
We extend HPVM to ApproxHPVM, introducing hardware-independent approximation metrics in the IR to enable maintaining accuracy information at the IR level and mapping of application-level end-to-end quality metrics to system level "knobs". The approximation metrics quantify the acceptable accuracy loss for individual computations. Application programmers only need to specify high-level, and end-to-end, quality metrics, instead of detailed parameters for individual approximation methods. The ApproxHPVM system then automatically tunes the accuracy requirements of individual computations and maps them to approximate hardware when possible. ApproxHPVM results show significant performance and energy improvements for popular deep learning benchmarks.
Finally, we extend to ApproxHPVM to ApproxTuner, a compiler and runtime system for approximation. ApproxTuner extends ApproxHPVM with a wide range of hardware and software approximation techniques. It uses a three step approximation tuning strategy, a combination of development-time, install-time, and dynamic tuning. Our strategy ensures software portability, even though approximations have highly hardware-dependent performance, and enables efficient dynamic approximation tuning despite the expensive offline steps. ApproxTuner results show significant performance and energy improvements across 7 Deep Neural Networks and 3 image processing benchmarks, and ensures that high-level end-to-end quality specifications are satisfied during adaptive approximation tuning
Automatic performance optimisation of parallel programs for GPUs via rewrite rules
Graphics Processing Units (GPUs) are now commonplace in computing systems and are the
most successful parallel accelerators. Their performance is orders of magnitude higher than
traditional Central Processing Units (CPUs) making them attractive for many application domains
with high computational demands. However, achieving their full performance potential
is extremely hard, even for experienced programmers, as it requires specialised software tailored
for specific devices written in low-level languages such as OpenCL. Differences in device
characteristics between manufacturers and even hardware generations often lead to large performance
variations when different optimisations are applied. This inevitably leads to code that
is not performance portable across different hardware.
This thesis demonstrates that achieving performance portability is possible using LIFT, a
functional data-parallel language which allows programs to be expressed at a high-level in a
hardware-agnostic way. The LIFT compiler is empowered to automatically explore the optimisation
space using a set of well-defined rewrite rules to transform programs seamlessly between
different high-level algorithmic forms before translating them to a low-level OpenCL-specific
form.
The first contribution of this thesis is the development of techniques to compile functional
LIFT programs that have optimisations explicitly encoded into efficient imperative OpenCL
code. Producing efficient code is non-trivial as many performance sensitive details such as
memory allocation, array accesses or synchronisation are not explicitly represented in the functional
LIFT language. The thesis shows that the newly developed techniques are essential for
achieving performance on par with manually optimised code for GPU programs with the exact
same complex optimisations applied.
The second contribution of this thesis is the presentation of techniques that enable the
LIFT compiler to perform complex optimisations that usually require from tens to hundreds of
individual rule applications by grouping them as macro-rules that cut through the optimisation
space. Using matrix multiplication as an example, starting from a single high-level program
the compiler automatically generates highly optimised and specialised implementations for
desktop and mobile GPUs with very different architectures achieving performance portability.
The final contribution of this thesis is the demonstration of how low-level and GPU-specific
features are extracted directly from the high-level functional LIFT program, enabling building
a statistical performance model that makes accurate predictions about the performance of differently
optimised program variants. This performance model is then used to drastically speed
up the time taken by the optimisation space exploration by ranking the different variants based
on their predicted performance.
Overall, this thesis demonstrates that performance portability is achievable using LIFT
- …