850 research outputs found
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
While many of the architectural details of future exascale-class high
performance computer systems are still a matter of intense research, there
appears to be a general consensus that they will be strongly heterogeneous,
featuring "standard" as well as "accelerated" resources. Today, such resources
are available as multicore processors, graphics processing units (GPUs), and
other accelerators such as the Intel Xeon Phi. Any software infrastructure that
claims usefulness for such environments must be able to meet their inherent
challenges: massive multi-level parallelism, topology, asynchronicity, and
abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a
collection of building blocks that targets algorithms dealing with sparse
matrix representations on current and future large-scale systems. It implements
the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel
numerical kernels, intelligent resource management, and truly heterogeneous
parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We
describe the details of its design with respect to the challenges posed by
modern heterogeneous supercomputers and recent algorithmic developments.
Implementation details which are indispensable for achieving high efficiency
are pointed out and their necessity is justified by performance measurements or
predictions based on performance models. The library code and several
applications are available as open source. We also provide instructions on how
to make use of GHOST in existing software packages, together with a case study
which demonstrates the applicability and performance of GHOST as a component
within a larger software stack.Comment: 32 pages, 11 figure
Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors
Asymmetric multicore processors (AMPs) have recently emerged as an appealing
technology for severely energy-constrained environments, especially in mobile
appliances where heterogeneity in applications is mainstream. In addition,
given the growing interest for low-power high performance computing, this type
of architectures is also being investigated as a means to improve the
throughput-per-Watt of complex scientific applications.
In this paper, we design and embed several architecture-aware optimizations
into a multi-threaded general matrix multiplication (gemm), a key operation of
the BLAS, in order to obtain a high performance implementation for ARM
big.LITTLE AMPs. Our solution is based on the reference implementation of gemm
in the BLIS library, and integrates a cache-aware configuration as well as
asymmetric--static and dynamic scheduling strategies that carefully tune and
distribute the operation's micro-kernels among the big and LITTLE cores of the
target processor. The experimental results on a Samsung Exynos 5422, a
system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the
big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric
scheduling attain important gains in performance with respect to its
architecture-oblivious counterparts while exploiting all the resources of the
AMP to deliver considerable energy efficiency
Tools for improving performance portability in heterogeneous environments
Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Abstract]
Parallel computing is currently partially dominated by the availability of heterogeneous
devices. These devices differ from each other in aspects such as the
instruction set they execute, the number and the type of computing devices that
they offer or the structure of their memory systems. In the last years, langnages,
libraries and extensions have appeared to allow to write a parallel code once aud
run it in a wide variety of devices, OpenCL being the most widespread solution of
this kind. However, functional portability does not imply performance portability.
This way, one of the probletns that is still open in this field is to achieve automatic
performance portability. That is, the ability to automatically tune a given code for
any device where it will be execnted so that it ill obtain a good performance. This
thesis develops three different solutions to tackle this problem. The three of them
are based on typical source-to-sonrce optimizations for heterogeneous devices. Both
the set of optimizations to apply and the way they are applied depend on different
optimization parameters, whose values have to be tuned for each specific device.
The first solution is OCLoptimizer, a source-to-source optimizer that can optimize
annotated OpenCL kemels with the help of configuration files that guide the
optimization process. The tool optimizes kernels for a specific device, and it is also
able to automate the generation of functional host codes when only a single kernel
is optimized.
The two remaining solutions are built on top of the Heterogeneous Programming
Library (HPL), a C++ framework that provides an easy and portable way to exploit
heterogeneous computing systexns. The first of these solutions uses the run-time
code generation capabilities of HPL to generate a self-optimizing version of a matrix
multiplication that can optimize itself at run-time for an spedfic device. The last solutíon is the development of a built-in just-in-time optirnizer for HPL, that can
optirnize, at run-tirne, a HPL code for an specific device. While the first two solutions
use search processes to find the best values for the optimization parameters, this Iast
alternative relies on heuristics bMed on general optirnization strategies.[Resumen]
Actualmente la computación paralela se encuentra dominada parcialmente por
los múltiples dispositivos heterogéneos disponibles. Estos dispositivos difieren entre
sí en características tales como el conjunto de instrucciones que ejecutan, el número
y tipo de unidades de computación que incluyen o la estructura de sus sistemas de
memoria. Durante los últimos años han aparecido lenguajes, librerías y extensiones
que permiten escribir una única vez la versión paralela de un código y ejecutarla en
un amplio abanico de dispositivos, siendo de entre todos ellos OpenCL la solución
más extendida. Sin embargo, la portabilidad funcional no implica portabilidad de
rendimiento. Así, uno de los grandes problemas que sigue abierto en este campo
es la automatización de la portabilidad de rendimiento, es decir, la capacidad de
adaptar automáticamente un código dado para su ejecución en cualquier dispositivo
y obtener un buen rendimiento. Esta tesis aborda este problema planteando tres
soluciones diferentes al mismo. Las tres se basan en la aplicación de optimizaciones
de código a código usadas habitualmente en dispositivos heterogéneos. Tanto el
conjunto de optimizaciones a aplicar como la forma de aplicarlas dependen de varios
parámetros de optimización, cuyos valores han de ser ajustados para cada dispositivo
concreto.
La primera solución planteada es OCLoptirnizer, un optimizador de código a
código que a partir de kernels OpenCL anotados y ficheros de configuración como
apoyo, obtiene versiones optimizada de dichos kernels para un dispositivo concreto.
Además, cuando el kernel a optimizar es único, automatiza la generación de un
código de host funcional para ese kernel.
Las otras dos soluciones han sido implementadas utilizando Heterogeneous Prograrnming
LibranJ (HPL), una librería C++ que permite programar sistemas heterogéneos de forma fácil y portable. La primera de estas soluciones explota las
capacidades de generación de código en tiempo de ejecución de HPL para generar
versiones de un producto de matrices que se adaptan automáticamente en tiempo
de ejecución a las características de un dispositivo concreto. La última solución consiste
en el desarrollo e incorporación a HPL de un optimizador al vuelo, de fonna
que se puedan obtener en tiempo de ejecución versiones optimizadas de un código
HPL para un dispositivo dado. Mientras las dos primeras soluciones usan procesos
de búsqueda para encontrar los mejores valores para los parámetros de optimización,
esta última altemativa se basa para ello en heurísticas definidas a partir de
recomendaciones generales de optimización.[Resumo]
Actualmente a computación paralela atópase dominada parcialmente polos múltiples
dispositivos heteroxéneos dispoñibles. Estes dispositivos difiren entre si en características
tales como o conxunto de instruccións que executan, o número e tipo
de unidades de computación que inclúen ou a estrutura dos seus sistemas de mem~
ría. Nos últimos anos apareceron linguaxes, bibliotecas e extensións que permiten
escribir unha soa vez a versión paralela dun código e executala nun amplio abano de
dispositivos, senda de entre todos eles OpenCL a solución máis extendida. Porén, a
portabilidade funcional non implica portabilidade de rendemento. Deste xeito, uns
dos grandes problemas que segue aberto neste campo é a automatización da portabilidade
de rendemento, isto é, a capacidade de adaptar automaticamente un código
dado para a súa execución en calquera dispositivo e obter un bo rendemento. Esta
tese aborda este problema propondo tres solucións diferentes. As tres están baseadas
na aplicación de optimizacións de código a código usadas habitualmente en disp~
sitivos heteroxéneos. Tanto o conxunto de optimizacións a aplicar como a forma de
aplicalas dependen de varios parámetros de optimización para os que é preciso fixar
determinados valores en función do dispositivo concreto.
A primeira solución pro posta é OCLoptirnizer, un optimizador de código a código
que partindo de kemels OpenCL anotados e ficheiros de configuración de apoio,
obtén versións optimizadas dos devanditos kernels para un dispositivo concreto.
Amais, cando o kernel a optimizaré único, tarnén automatiza a xeración dun código
de host funcional para ese kernel.
As outras dúas solucións foron implementadas utilizando Heterogeneous Programming
Library (HPL), unha biblioteca C++ que permite programar sistemas
heteroxéneos de xeito fácil e portable. A primeira destas solucións explota as capacidades de xeración de código en tempo de execución de HPL para xerar versións
dun produto de matrices que se adaptan automaticamente ás características dun
dispositivo concreto. A última solución consiste no deseuvolvemento e incorporación
a HPL dun optimizador capaz de obter en tiempo de execución versións optimizada<;
dun código HPL para un dispositivo dado. Mentres as dúas primeiras solucións usan
procesos de procura para atopar os mellares valores para os parámetros de optimización,
esta última alternativa baséase para iso en heurísticas definidas a partir de
recomendacións xerais de optimización
A Survey on Compiler Autotuning using Machine Learning
Since the mid-1990s, researchers have been trying to use machine-learning
based approaches to solve a number of different compiler optimization problems.
These techniques primarily enhance the quality of the obtained results and,
more importantly, make it feasible to tackle two main compiler optimization
problems: optimization selection (choosing which optimizations to apply) and
phase-ordering (choosing the order of applying optimizations). The compiler
optimization space continues to grow due to the advancement of applications,
increasing number of compiler optimizations, and new target architectures.
Generic optimization passes in compilers cannot fully leverage newly introduced
optimizations and, therefore, cannot keep up with the pace of increasing
options. This survey summarizes and classifies the recent advances in using
machine learning for the compiler optimization field, particularly on the two
major problems of (1) selecting the best optimizations and (2) the
phase-ordering of optimizations. The survey highlights the approaches taken so
far, the obtained results, the fine-grain classification among different
approaches and finally, the influential papers of the field.Comment: version 5.0 (updated on September 2018)- Preprint Version For our
Accepted Journal @ ACM CSUR 2018 (42 pages) - This survey will be updated
quarterly here (Send me your new published papers to be added in the
subsequent version) History: Received November 2016; Revised August 2017;
Revised February 2018; Accepted March 2018
Master of Science
thesisThe advent of the era of cheap and pervasive many-core and multicore parallel sys-tems has highlighted the disparity of the performance achieved between novice and expert developers targeting parallel architectures. This disparity is most notiable with software for running general purpose computations on grachics processing units (GPGPU programs). Current methods for implementing GPGPU programs require an expert level understanding of the memory hierarchy and execution model of the hardware to reach peak performance. Even for experts, rewriting a program to exploit these hardware features can be tedious and error prone. Compilers and their ability to make code transformations can assist in the implementation of GPGPU programs, handling many of the target specic details. This thesis presents CUDA-CHiLL, a source to source compiler transformation and code generation framework for the parallelization and optimization of computations expressed in sequential loop nests for running on many-core GPUs. This system uniquely uses a complete scripting language to describe composable compiler transformations that can be written, shared and reused by nonexpert application and library developers. CUDA-CHiLL is built on the polyhedral program transformation and code generation framework CHiLL, which is capable of robust composition of transformations while preserving the correctness of the program at each step. Through its use of powerful abstractions and a scripting interface, CUDA-CHiLL allows for a developer to focus on optimization strategies and ignore the error prone details and low level constructs of GPGPU programming. The high level framework can be used inside an orthogonal auto-tuning system that can quickly evaluate the space of possible implementations. Although specicl to CUDA at the moment, many of the abstractions would hold for any GPGPU framework, particularly Open CL. The contributions of this thesis include a programming language approach to providing transformation abstraction and composition, a unifying framework for general and GPU specicl transformations, and demonstration of the framework on standard benchmarks that show it capable of matching or outperforming hand-tuned GPU kernels
- …