    Accelerating Scientific Computing Models Using GPU Processing

    GPGPUs offer significant computational power for programmers to leverage. This computational power is especially useful when utilized for accelerating scientific models. This thesis analyzes the utilization of GPGPU programming to accelerate scientific computing models. First the construction of hardware for visualization and computation of scientific models is discussed. Several factors in the construction of the machines focus on the performance impacts related to scientific modeling. Image processing is an embarrassingly parallel problem well suited for GPGPU acceleration. An image processing library was developed to show the processes of recognizing embarrassingly parallel problems and serves as an excellent example of converting from a serial CPU implementation to a GPU accelerated implementation. Genetic algorithms are biologically inspired heuristic search algorithms based on natural selection. The Tetris genetic algorithm with A* pathfinding discusses memory bound limitations that can prevent direct algorithm conversions from the CPU to the GPU. An analysis of an existing landscape evolution model, CHILD, for GPU acceleration explores that even when a model shows promise for GPU acceleration, the underlying data structures can have a significant impact upon that ability to move to a GPU implementation. CHILD also offers an example of creating tighter MATLAB integration between existing models. Lastly, a parallel spatial sorting algorithm is discussed as a possible replacement for current spatial sorting algorithms implemented in models such as smoothed particle hydrodynamics

    Genetic improvement of GPU software

    We survey genetic improvement (GI) of general purpose computing on graphics cards. We summarise several experiments which demonstrate four themes. Experiments with the gzip program show that genetic programming can automatically port sequential C code to parallel code. Experiments with the StereoCamera program show that GI can upgrade legacy parallel code for new hardware and software. Experiments with NiftyReg and BarraCUDA show that GI can make substantial improvements to current parallel CUDA applications. Finally, experiments with the pknotsRG program show that with semi-automated approaches, enormous speed ups can sometimes be had by growing and grafting new code with genetic programming in combination with human input

    Tools for improving performance portability in heterogeneous environments

    Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Abstract] Parallel computing is currently partially dominated by the availability of heterogeneous devices. These devices differ from each other in aspects such as the instruction set they execute, the number and the type of computing devices that they offer or the structure of their memory systems. In the last years, langnages, libraries and extensions have appeared to allow to write a parallel code once aud run it in a wide variety of devices, OpenCL being the most widespread solution of this kind. However, functional portability does not imply performance portability. This way, one of the probletns that is still open in this field is to achieve automatic performance portability. That is, the ability to automatically tune a given code for any device where it will be execnted so that it ill obtain a good performance. This thesis develops three different solutions to tackle this problem. The three of them are based on typical source-to-sonrce optimizations for heterogeneous devices. Both the set of optimizations to apply and the way they are applied depend on different optimization parameters, whose values have to be tuned for each specific device. The first solution is OCLoptimizer, a source-to-source optimizer that can optimize annotated OpenCL kemels with the help of configuration files that guide the optimization process. The tool optimizes kernels for a specific device, and it is also able to automate the generation of functional host codes when only a single kernel is optimized. The two remaining solutions are built on top of the Heterogeneous Programming Library (HPL), a C++ framework that provides an easy and portable way to exploit heterogeneous computing systexns. The first of these solutions uses the run-time code generation capabilities of HPL to generate a self-optimizing version of a matrix multiplication that can optimize itself at run-time for an spedfic device. The last solutíon is the development of a built-in just-in-time optirnizer for HPL, that can optirnize, at run-tirne, a HPL code for an specific device. While the first two solutions use search processes to find the best values for the optimization parameters, this Iast alternative relies on heuristics bMed on general optirnization strategies.[Resumen] Actualmente la computación paralela se encuentra dominada parcialmente por los múltiples dispositivos heterogéneos disponibles. Estos dispositivos difieren entre sí en características tales como el conjunto de instrucciones que ejecutan, el número y tipo de unidades de computación que incluyen o la estructura de sus sistemas de memoria. Durante los últimos años han aparecido lenguajes, librerías y extensiones que permiten escribir una única vez la versión paralela de un código y ejecutarla en un amplio abanico de dispositivos, siendo de entre todos ellos OpenCL la solución más extendida. Sin embargo, la portabilidad funcional no implica portabilidad de rendimiento. Así, uno de los grandes problemas que sigue abierto en este campo es la automatización de la portabilidad de rendimiento, es decir, la capacidad de adaptar automáticamente un código dado para su ejecución en cualquier dispositivo y obtener un buen rendimiento. Esta tesis aborda este problema planteando tres soluciones diferentes al mismo. Las tres se basan en la aplicación de optimizaciones de código a código usadas habitualmente en dispositivos heterogéneos. Tanto el conjunto de optimizaciones a aplicar como la forma de aplicarlas dependen de varios parámetros de optimización, cuyos valores han de ser ajustados para cada dispositivo concreto. La primera solución planteada es OCLoptirnizer, un optimizador de código a código que a partir de kernels OpenCL anotados y ficheros de configuración como apoyo, obtiene versiones optimizada de dichos kernels para un dispositivo concreto. Además, cuando el kernel a optimizar es único, automatiza la generación de un código de host funcional para ese kernel. Las otras dos soluciones han sido implementadas utilizando Heterogeneous Prograrnming LibranJ (HPL), una librería C++ que permite programar sistemas heterogéneos de forma fácil y portable. La primera de estas soluciones explota las capacidades de generación de código en tiempo de ejecución de HPL para generar versiones de un producto de matrices que se adaptan automáticamente en tiempo de ejecución a las características de un dispositivo concreto. La última solución consiste en el desarrollo e incorporación a HPL de un optimizador al vuelo, de fonna que se puedan obtener en tiempo de ejecución versiones optimizadas de un código HPL para un dispositivo dado. Mientras las dos primeras soluciones usan procesos de búsqueda para encontrar los mejores valores para los parámetros de optimización, esta última altemativa se basa para ello en heurísticas definidas a partir de recomendaciones generales de optimización.[Resumo] Actualmente a computación paralela atópase dominada parcialmente polos múltiples dispositivos heteroxéneos dispoñibles. Estes dispositivos difiren entre si en características tales como o conxunto de instruccións que executan, o número e tipo de unidades de computación que inclúen ou a estrutura dos seus sistemas de mem~ ría. Nos últimos anos apareceron linguaxes, bibliotecas e extensións que permiten escribir unha soa vez a versión paralela dun código e executala nun amplio abano de dispositivos, senda de entre todos eles OpenCL a solución máis extendida. Porén, a portabilidade funcional non implica portabilidade de rendemento. Deste xeito, uns dos grandes problemas que segue aberto neste campo é a automatización da portabilidade de rendemento, isto é, a capacidade de adaptar automaticamente un código dado para a súa execución en calquera dispositivo e obter un bo rendemento. Esta tese aborda este problema propondo tres solucións diferentes. As tres están baseadas na aplicación de optimizacións de código a código usadas habitualmente en disp~ sitivos heteroxéneos. Tanto o conxunto de optimizacións a aplicar como a forma de aplicalas dependen de varios parámetros de optimización para os que é preciso fixar determinados valores en función do dispositivo concreto. A primeira solución pro posta é OCLoptirnizer, un optimizador de código a código que partindo de kemels OpenCL anotados e ficheiros de configuración de apoio, obtén versións optimizadas dos devanditos kernels para un dispositivo concreto. Amais, cando o kernel a optimizaré único, tarnén automatiza a xeración dun código de host funcional para ese kernel. As outras dúas solucións foron implementadas utilizando Heterogeneous Programming Library (HPL), unha biblioteca C++ que permite programar sistemas heteroxéneos de xeito fácil e portable. A primeira destas solucións explota as capacidades de xeración de código en tempo de execución de HPL para xerar versións dun produto de matrices que se adaptan automaticamente ás características dun dispositivo concreto. A última solución consiste no deseuvolvemento e incorporación a HPL dun optimizador capaz de obter en tiempo de execución versións optimizada<; dun código HPL para un dispositivo dado. Mentres as dúas primeiras solucións usan procesos de procura para atopar os mellares valores para os parámetros de optimización, esta última alternativa baséase para iso en heurísticas definidas a partir de recomendacións xerais de optimización

    Pipelined genetic propagation

    © 2015 IEEE.Genetic Algorithms (GAs) are a class of numerical and combinatorial optimisers which are especially useful for solving complex non-linear and non-convex problems. However, the required execution time often limits their application to small-scale or latency-insensitive problems, so techniques to increase the computational efficiency of GAs are needed. FPGA-based acceleration has significant potential for speeding up genetic algorithms, but existing FPGA GAs are limited by the generational approaches inherited from software GAs. Many parts of the generational approach do not map well to hardware, such as the large shared population memory and intrinsic loop-carried dependency. To address this problem, this paper proposes a new hardware-oriented approach to GAs, called Pipelined Genetic Propagation (PGP), which is intrinsically distributed and pipelined. PGP represents a GA solver as a graph of loosely coupled genetic operators, which allows the solution to be scaled to the available resources, and also to dynamically change topology at run-time to explore different solution strategies. Experiments show that pipelined genetic propagation is effective in solving seven different applications. Our PGP design is 5 times faster than a recent FPGA-based GA system, and 90 times faster than a CPU-based GA system

    A Survey on Compiler Autotuning using Machine Learning

    Since the mid-1990s, researchers have been trying to use machine-learning based approaches to solve a number of different compiler optimization problems. These techniques primarily enhance the quality of the obtained results and, more importantly, make it feasible to tackle two main compiler optimization problems: optimization selection (choosing which optimizations to apply) and phase-ordering (choosing the order of applying optimizations). The compiler optimization space continues to grow due to the advancement of applications, increasing number of compiler optimizations, and new target architectures. Generic optimization passes in compilers cannot fully leverage newly introduced optimizations and, therefore, cannot keep up with the pace of increasing options. This survey summarizes and classifies the recent advances in using machine learning for the compiler optimization field, particularly on the two major problems of (1) selecting the best optimizations and (2) the phase-ordering of optimizations. The survey highlights the approaches taken so far, the obtained results, the fine-grain classification among different approaches and finally, the influential papers of the field.Comment: version 5.0 (updated on September 2018)- Preprint Version For our Accepted Journal @ ACM CSUR 2018 (42 pages) - This survey will be updated quarterly here (Send me your new published papers to be added in the subsequent version) History: Received November 2016; Revised August 2017; Revised February 2018; Accepted March 2018

    METADOCK: A parallel metaheuristic schema for virtual screening methods

    Virtual screening through molecular docking can be translated into an optimization problem, which can be tackled with metaheuristic methods. The interaction between two chemical compounds (typically a protein, enzyme or receptor, and a small molecule, or ligand) is calculated by using highly computationally demanding scoring functions that are computed at several binding spots located throughout the protein surface. This paper introduces METADOCK, a novel molecular docking methodology based on parameterized and parallel metaheuristics and designed to leverage heterogeneous computers based on heterogeneous architectures. The application decides the optimization technique at running time by setting a configuration schema. Our proposed solution finds a good workload balance via dynamic assignment of jobs to heterogeneous resources which perform independent metaheuristic executions when computing different molecular interactions required by the scoring functions in use. A cooperative scheduling of jobs optimizes the quality of the solution and the overall performance of the simulation, so opening a new path for further developments of virtual screening methods on high-performance contemporary heterogeneous platforms.