338 research outputs found

    High-level programming of stencil computations on multi-GPU systems using the SkelCL library

    Get PDF
    The implementation of stencil computations on modern, massively parallel systems with GPUs and other accelerators currently relies on manually-tuned coding using low-level approaches like OpenCL and CUDA. This makes development of stencil applications a complex, time-consuming, and error-prone task. We describe how stencil computations can be programmed in our SkelCL approach that combines high-level programming abstractions with competitive performance on multi-GPU systems. SkelCL extends the OpenCL standard by three high-level features: 1) pre-implemented parallel patterns (a.k.a. skeletons); 2) container data types for vectors and matrices; 3) automatic data (re)distribution mechanism. We introduce two new SkelCL skeletons which specifically target stencil computations – MapOverlap and Stencil – and we describe their use for particular application examples, discuss their efficient parallel implementation, and report experimental results on systems with multiple GPUs. Our evaluation of three real-world applications shows that stencil code written with SkelCL is considerably shorter and offers competitive performance to hand-tuned OpenCL code

    Many-core compiler fuzzing

    Get PDF
    We address the compiler correctness problem for many-core systems through novel applications of fuzz testing to OpenCL compilers. Focusing on two methods from prior work, random differential testing and testing via equivalence modulo inputs (EMI), we present several strategies for random generation of deterministic, communicating OpenCL kernels, and an injection mechanism that allows EMI testing to be applied to kernels that otherwise exhibit little or no dynamically-dead code. We use these methods to conduct a large, controlled testing campaign with respect to 21 OpenCL (device, compiler) configurations, covering a range of CPU, GPU, accelerator, FPGA and emulator implementations. Our study provides independent validation of claims in prior work related to the effectiveness of random differential testing and EMI testing, proposes novel methods for lifting these techniques to the many-core setting and reveals a significant number of OpenCL compiler bugs in commercial implementations

    Exploiting Heterogeneous Parallelism With the Heterogeneous Programming Library

    Get PDF
    [Abstract] While recognition of the advantages of heterogeneous computing is steadily growing, the issues of programmability and portability hinder its exploitation. The introduction of the OpenCL standard was a major step forward in that it provides code portability, but its interface is even more complex than that of other approaches. In this paper, we present the Heterogeneous Programming Library (HPL), which permits the development of heterogeneous applications addressing both portability and programmability while not sacrificing high performance. This is achieved by means of an embedded language and data types provided by the library with which generic computations to be run in heterogeneous devices can be expressed. A comparison in terms of programmability and performance with OpenCL shows that both approaches offer very similar performance, while outlining the programmability advantages of HPL.This work was funded by the Xunta de Galicia under the project “Consolidación e Estructuración de Unidades de Investigación Competitivas” 2010/06 and the MICINN, cofunded by FEDER funds, under grant TIN2010-16735. Zeki Bozkus is funded by the Scientific and Technological Research Council of Turkey (TUBITAK; 112E191)Scientific and Technological Research Council of Turkey (TUBITAK); 112E19

    Simulation-Based Sailboat Trajectory Optimization using On-Board Heterogeneous Computers

    Get PDF
    A dynamic programming-based algorithm adapted to on-board heterogeneouscomputers for simulation-based trajectory optimization was studied inthe context of high-performance sailing. The algorithm can efficiently utilizeall OpenCL-capable devices, starting the computation (if necessary, in singleprecision)on a GPU and finalizing it (if necessary, in double-precision) withthe use of a CPU. The serial and parallel versions of the algorithm are presentedin detail. Possible extensions of the basic algorithm are also described. Theexperimental results show that contemporary heterogeneous on-board/mobilecomputers can be treated as micro HPC platforms. They offer high performance(the OpenCL-capable GPU was found to accelerate the optimization routine 41fold) while remaining energy and cost efficient. The simulation-based approachhas the potential to give very accurate results, as the mathematical model uponwhich the simulator is based may be as complex as required. The black-box representedperformance measure and the use of OpenCL make the presentedapproach applicable to many trajectory optimization problems

    Tools for improving performance portability in heterogeneous environments

    Get PDF
    Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Abstract] Parallel computing is currently partially dominated by the availability of heterogeneous devices. These devices differ from each other in aspects such as the instruction set they execute, the number and the type of computing devices that they offer or the structure of their memory systems. In the last years, langnages, libraries and extensions have appeared to allow to write a parallel code once aud run it in a wide variety of devices, OpenCL being the most widespread solution of this kind. However, functional portability does not imply performance portability. This way, one of the probletns that is still open in this field is to achieve automatic performance portability. That is, the ability to automatically tune a given code for any device where it will be execnted so that it ill obtain a good performance. This thesis develops three different solutions to tackle this problem. The three of them are based on typical source-to-sonrce optimizations for heterogeneous devices. Both the set of optimizations to apply and the way they are applied depend on different optimization parameters, whose values have to be tuned for each specific device. The first solution is OCLoptimizer, a source-to-source optimizer that can optimize annotated OpenCL kemels with the help of configuration files that guide the optimization process. The tool optimizes kernels for a specific device, and it is also able to automate the generation of functional host codes when only a single kernel is optimized. The two remaining solutions are built on top of the Heterogeneous Programming Library (HPL), a C++ framework that provides an easy and portable way to exploit heterogeneous computing systexns. The first of these solutions uses the run-time code generation capabilities of HPL to generate a self-optimizing version of a matrix multiplication that can optimize itself at run-time for an spedfic device. The last solutíon is the development of a built-in just-in-time optirnizer for HPL, that can optirnize, at run-tirne, a HPL code for an specific device. While the first two solutions use search processes to find the best values for the optimization parameters, this Iast alternative relies on heuristics bMed on general optirnization strategies.[Resumen] Actualmente la computación paralela se encuentra dominada parcialmente por los múltiples dispositivos heterogéneos disponibles. Estos dispositivos difieren entre sí en características tales como el conjunto de instrucciones que ejecutan, el número y tipo de unidades de computación que incluyen o la estructura de sus sistemas de memoria. Durante los últimos años han aparecido lenguajes, librerías y extensiones que permiten escribir una única vez la versión paralela de un código y ejecutarla en un amplio abanico de dispositivos, siendo de entre todos ellos OpenCL la solución más extendida. Sin embargo, la portabilidad funcional no implica portabilidad de rendimiento. Así, uno de los grandes problemas que sigue abierto en este campo es la automatización de la portabilidad de rendimiento, es decir, la capacidad de adaptar automáticamente un código dado para su ejecución en cualquier dispositivo y obtener un buen rendimiento. Esta tesis aborda este problema planteando tres soluciones diferentes al mismo. Las tres se basan en la aplicación de optimizaciones de código a código usadas habitualmente en dispositivos heterogéneos. Tanto el conjunto de optimizaciones a aplicar como la forma de aplicarlas dependen de varios parámetros de optimización, cuyos valores han de ser ajustados para cada dispositivo concreto. La primera solución planteada es OCLoptirnizer, un optimizador de código a código que a partir de kernels OpenCL anotados y ficheros de configuración como apoyo, obtiene versiones optimizada de dichos kernels para un dispositivo concreto. Además, cuando el kernel a optimizar es único, automatiza la generación de un código de host funcional para ese kernel. Las otras dos soluciones han sido implementadas utilizando Heterogeneous Prograrnming LibranJ (HPL), una librería C++ que permite programar sistemas heterogéneos de forma fácil y portable. La primera de estas soluciones explota las capacidades de generación de código en tiempo de ejecución de HPL para generar versiones de un producto de matrices que se adaptan automáticamente en tiempo de ejecución a las características de un dispositivo concreto. La última solución consiste en el desarrollo e incorporación a HPL de un optimizador al vuelo, de fonna que se puedan obtener en tiempo de ejecución versiones optimizadas de un código HPL para un dispositivo dado. Mientras las dos primeras soluciones usan procesos de búsqueda para encontrar los mejores valores para los parámetros de optimización, esta última altemativa se basa para ello en heurísticas definidas a partir de recomendaciones generales de optimización.[Resumo] Actualmente a computación paralela atópase dominada parcialmente polos múltiples dispositivos heteroxéneos dispoñibles. Estes dispositivos difiren entre si en características tales como o conxunto de instruccións que executan, o número e tipo de unidades de computación que inclúen ou a estrutura dos seus sistemas de mem~ ría. Nos últimos anos apareceron linguaxes, bibliotecas e extensións que permiten escribir unha soa vez a versión paralela dun código e executala nun amplio abano de dispositivos, senda de entre todos eles OpenCL a solución máis extendida. Porén, a portabilidade funcional non implica portabilidade de rendemento. Deste xeito, uns dos grandes problemas que segue aberto neste campo é a automatización da portabilidade de rendemento, isto é, a capacidade de adaptar automaticamente un código dado para a súa execución en calquera dispositivo e obter un bo rendemento. Esta tese aborda este problema propondo tres solucións diferentes. As tres están baseadas na aplicación de optimizacións de código a código usadas habitualmente en disp~ sitivos heteroxéneos. Tanto o conxunto de optimizacións a aplicar como a forma de aplicalas dependen de varios parámetros de optimización para os que é preciso fixar determinados valores en función do dispositivo concreto. A primeira solución pro posta é OCLoptirnizer, un optimizador de código a código que partindo de kemels OpenCL anotados e ficheiros de configuración de apoio, obtén versións optimizadas dos devanditos kernels para un dispositivo concreto. Amais, cando o kernel a optimizaré único, tarnén automatiza a xeración dun código de host funcional para ese kernel. As outras dúas solucións foron implementadas utilizando Heterogeneous Programming Library (HPL), unha biblioteca C++ que permite programar sistemas heteroxéneos de xeito fácil e portable. A primeira destas solucións explota as capacidades de xeración de código en tempo de execución de HPL para xerar versións dun produto de matrices que se adaptan automaticamente ás características dun dispositivo concreto. A última solución consiste no deseuvolvemento e incorporación a HPL dun optimizador capaz de obter en tiempo de execución versións optimizada<; dun código HPL para un dispositivo dado. Mentres as dúas primeiras solucións usan procesos de procura para atopar os mellares valores para os parámetros de optimización, esta última alternativa baséase para iso en heurísticas definidas a partir de recomendacións xerais de optimización

    О производительности блочных шифров, основанных на клеточных автоматах, при их реализации на графических процессорах

    Get PDF
    Block ciphers have found extensive use when solving the tasks of information security. The article considers implementation and testing of the performance of symmetric block ciphers based on generalized cellular automata, which were constructed using the techniques the author developed earlier in the software implementation on the GPUs NVIDIA GTX 650, NVIDIA GTX 770, AMD R9 280X.The implementation used the OpenCL interface. There were used codes with 4 rounds, the number of steps of the generalized cellular automata in each round was chosen from the list: 8, 12, 16; the block length was 128 bits. As a graph of the cellular automata, the Lubotzky-Phillips-Sarnak graph was used. Performance of obtained implementation for the counter and ECB modes was in the range from 90 to 380 Mbit/s, depending on parameters, which is comparable with the performance of CPU-based traditional block ciphers, such as AES, DES, BLOWFISH, CAST, RC6, IDEA. At the same time, performance in the CBC mode ranged from 7 to 29 Mbit /s.Given that the cryptographic algorithms based on generalized cellular automata, are designed for hardware implementation, the achieved level of performance in ECB and counter modes significantly broadens the application scope of these block ciphers is actually enabling their use in any computing device that has a GPU, including personal computers, laptops, tablets, smartphones, etc.The work was implemented under support of the Russian Federal Property Fund, as part of a research project and №16-07-00542.Широкое применение в задачах информационной безопасности получили блочные шифры. Статья посвящена вопросам реализации и тестированию производительности симметричных блочных шифров, основанных на обобщенных клеточных автоматах, построенных с помощью методов, разработанных автором ранее, при программной реализации на графических процессорах NVIDIA GTX 650, NVIDIA GTX 770, AMD R9 280X.Реализация производилась с использованием интерфейса OpenCL. Использовались шифры с 4 раундами, число шагов обобщенного клеточного автомата в каждом раунде выбиралось из списка: 8, 12, 16; длина блока составляла 128 битов. В качестве графа клеточного автомата использовался модифицированный граф Любоцкого-Филипса-Сарнака. Производительность полученной реализации для режима счетчика и режима ECB составила от 90 до 380 Мбит/с, в зависимости от параметров, что сопоставимо с производительностью традиционных блочных шифров, таких как AES, DES, BLOWFISH, CAST, RC6, IDEA на CPU. В то же время, производительность в режиме CBC составляла от 7 до 29 Мбит/с.Учитывая, что криптоалгоритмы, основанные на обобщенных клеточных автоматах, предназначены для аппаратной реализации, достигнутый уровень производительности в режимах ECB и счетчика существенно расширяет область применения данных блочных шифров, фактически делая возможным их использование на любом вычислительном устройстве, имеющем GPU, в том числе на персональных компьютерах, ноутбуках, планшетах, смартфонах и др.Работа выполнена при финансовой поддержке РФФИ, в рамках научного проекта №16-07-00542 а

    Supporting efficient overlapping of host-device operations for heterogeneous programming with CtrlEvents

    Get PDF
    Producción CientíficaHeterogeneous systems with several kinds of devices, such as multi-core CPUs, GPUs, FPGAs, among others, are now commonplace. Exploiting all these devices with device-oriented programming models, such as CUDA or OpenCL, requires expertise and knowledge about the underlying hardware to tailor the application to each specific device, thus degrading performance portability. Higher-level proposals simplify the programming of these devices, but their current implementations do not have an efficient support to solve problems that include frequent bursts of computation and communication, or input/output operations. In this work we present CtrlEvents, a new heterogeneous runtime solution which automatically overlaps computation and communication whenever possible, simplifying and improving the efficiency of data-dependency analysis and the coordination of both device computations and host tasks that include generic I/O operations. Our solution outperforms other state-of-the-art implementations for most situations, presenting a good balance between portability, programmability and efficiency.Ministerio de Ciencia e Innovación - FEDER (TIN2017-88614-R)Junta de Castilla y León (VA226P20)Ministerio de Ciencia e Innovación - AEI and European Union NextGenerationEU/PRTR (TED2021–130367B–I00 and MCIN/AEI/10.13039/501100011033
    corecore