    A multi-GPU shallow-water simulation with transport of contaminants

    [Abstract] This work presents cost-effective multi-graphics processing unit (GPU) parallel implementations of a finite-volume numerical scheme for solving pollutant transport problems in bidimensional domains. The fluid is modeled by 2D shallow-water equations, whereas the transport of pollutant is modeled by a transport equation. The 2D domain is discretized using a first-order Roe finite-volume scheme. Specifically, this paper presents multi-GPU implementations of both a solution that exploits recomputation on the GPU and an optimized solution that is based on a ghost cell decoupling approach. Our multi-GPU implementations have been optimized using nonblocking communications, overlapping communications and computations and the application of ghost cell expansion to minimize communications. The fastest one reached a speedup of 78 × using four GPUs on an InfiniBand network with respect to a parallel execution on a multicore CPU with six cores and two-way hyperthreading per core. Such performance, measured using a realistic problem, enabled the calculation of solutions not only in real time but also in orders of magnitude faster than the simulated time.Copyright © 2012 John Wiley & Sons, Ltd

    Numerical Simulation of Pollutant Transport in a Shallow-Water System on the Cell Heterogeneous Processor

    [Abstract] This paper presents an implementation, optimized for the Cell processor, of a finite volume numerical scheme for 2D shallow-water systems with pollutant transport. A description of the special architecture and programming required by the Cell processor motivates the methodology to develop optimized implementations for this platform. This process involves parallelization, data structure reorganization, explicit transfers of data and computation vectorization. Our implementation, tested using a realistic problem, achieves very good speedups with respect to the sequential execution on a standard CPU.This work was partially supported by the Science and Innovation Ministry of Spain (Projects TIN2010-16735, MTM2010-21135-C02-01 and MTM2009-11923), Xunta de Galicia CN2012/211 (partially supported by FEDER funds), and the FPU program of the Spanish Government (ref AP2009-4752)Xunta de Galicia; CN 2012/21

    Developing adaptive multi-device applications with the Heterogeneous Programming Library

    [Abstract] The usage of heterogeneous devices presents two main problems. One is their complex programming, a problem that grows when multiple devices are used. The second issue is that even if the codes for these devices can be portable on top of OpenCL, they lack performance portability, effectively requiring specialized implementations for each device to get good performance. In this paper we extend the Heterogeneous Programming Library (HPL), which improves the usability of heterogeneous systems on top of OpenCL, to better handle both issues. First, we provide HPL with mechanisms to support the implementation of any multi-device application that requires arbitrary patterns of communication between several devices and a host memory. In a second stage HPL is improved with an adaptive scheme to optimize communications between devices depending on the execution environment. An evaluation using benchmarks with very different nature shows that HPL reduces the SLOCs and programming effort of OpenCL applications by 27 and 43 %, respectively, while improving the performance of applications that exchange data between devices by 28 % on average.Xunta de Galicia; GRC2013/055Ministerio de Economía y Competitividad; TIN2013-42148-PConsejo de Investigación Científica y Tecnológica de Turquía (TUBITAK); 112E191European Cooperation in Science and Technology (COST); IC130


    Exsolution and re-dissolution of CO2 gas within heterogeneous porous media are investigated using experimental data and mathematical modeling. In a set of bench-scale experiments, water saturated with CO2 under a given pressure is injected into a 2-D water-saturated porous media system, causing CO2 gas to exsolve and migrate upwards. A layer of fine sand mimicking a heterogeneity within a shallow aquifer is present in the tank to study accumulation and trapping of exsolved CO2. Then, clean water is injected into the system and the accumulated CO2 dissolves back into the flowing water. Simulated exsolution and dissolution mass transfer processes are studied using both nearequilibrium and kinetic approaches and compared to experimental data under conditions that do and do not include lateral background water flow. The mathematical model is based on the mixed hybrid finite element method that allows for accurate simulation of both advection- and diffusion- dominated processes

    A robust high-resolution hydrodynamic numerical model for surface water flow and transport processes within a flexible software framework

    Paralleltitel: Ein robustes hochauflösendes hydrodynamisch-numerisches Modell für Oberflächenabfluss- und Transportprozesse innerhalb eines flexiblen Software-Framework

    Towards efficient exploitation of GPUs : a methodology for mapping index-digit algorithms

    [Resumen]La computación de propósito general en GPUs supuso un gran paso, llevando la computación de alto rendimiento a los equipos domésticos. Lenguajes de programación de alto nivel como OpenCL y CUDA redujeron en gran medida la complejidad de programación. Sin embargo, para poder explotar totalmente el poder computacional de las GPUs, se requieren algoritmos paralelos especializados. La complejidad en la jerarquía de memoria y su arquitectura masivamente paralela hace que la programación de GPUs sea una tarea compleja incluso para programadores experimentados. Debido a la novedad, las librerías de propósito general son escasas y las versiones paralelas de los algoritmos no siempre están disponibles. En lugar de centrarnos en la paralelización de algoritmos concretos, en esta tesis proponemos una metodología general aplicable a la mayoría de los problemas de tipo divide y vencerás con una estructura de mariposa que puedan formularse a través de la representación Indice-Dígito. En primer lugar, se analizan los diferentes factores que afectan al rendimiento de la arquitectura de las GPUs. A continuación, estudiamos varias técnicas de optimización y diseñamos una serie de bloques constructivos modulares y reutilizables, que se emplean para crear los diferentes algoritmos. Por último, estudiamos el equilibrio óptimo de los recursos, y usando vectores de mapeo y operadores algebraicos ajustamos los algoritmos para las configuraciones deseadas. A pesar del enfoque centrado en la exibilidad y la facilidad de programación, las implementaciones resultantes ofrecen un rendimiento muy competitivo, que llega a superar conocidas librerías recientes.[Resumo] A computación de propósito xeral en GPUs supuxo un gran paso, levando a computación de alto rendemento aos equipos domésticos. Linguaxes de programación de alto nivel como OpenCL e CUDA reduciron en boa medida a complexidade da programación. Con todo, para poder aproveitar totalmente o poder computacional das GPUs, requírense algoritmos paralelos especializados. A complexidade na xerarquía de memoria e a súa arquitectura masivamente paralela fai que a programación de GPUs sexa unha tarefa complexa mesmo para programadores experimentados. Debido á novidade, as librarías de propósito xeral son escasas e as versións paralelas dos algoritmos non sempre están dispoñibles. En lugar de centrarnos na paralelización de algoritmos concretos, nesta tese propoñemos unha metodoloxía xeral aplicable á maioría dos problemas de tipo divide e vencerás cunha estrutura de bolboreta que poidan formularse a través da representación Índice-Díxito. En primeiro lugar, analízanse os diferentes factores que afectan ao rendemento da arquitectura das GPUs. A continuación, estudamos varias técnicas de optimización e deseñamos unha serie de bloques construtivos modulares e reutilizables, que se empregan para crear os diferentes algoritmos. Por último, estudamos o equilibrio óptimo dos recursos, e usando vectores de mapeo e operadores alxbricos axustamos os algoritmos para as configuracións desexadas. A pesar do enfoque centrado na exibilidade e a facilidade de programación, as implementacións resultantes ofrecen un rendemento moi competitivo, que chega a superar coñecidas librarías recentes.[Abstract]GPU computing supposed a major step forward, bringing high performance computing to commodity hardware. Feature-rich parallel languages like CUDA and OpenCL reduced the programming complexity. However, to fully take advantage of their computing power, specialized parallel algorithms are required. Moreover, the complex GPU memory hierarchy and highly threaded architecture makes programming a difficult task even for experienced programmers. Due to the novelty of GPU programming, common general purpose libraries are scarce and parallel versions of the algorithms are not always readily available. Instead of focusing in the parallelization of particular algorithms, in this thesis we propose a general methodology applicable to most divide-and-conquer problems with a buttery structure which can be formulated through the Index-Digit representation. First, we analyze the different performance factors of the GPU architecture. Next, we study several optimization techniques and design a series of modular and reusable building blocks, which will be used to create the different algorithms. Finally, we study the optimal resource balance, and through a mapping vector representation and operator algebra, we tune the algorithms for the desired configurations. Despite the focus on programmability and exibility, the resulting implementations offer very competitive performance, being able to surpass other well-known state of the art libraries

    A Portable and Adaptable Fault Tolerance Solution for Heterogeneous Applications

    [Abstract] Heterogeneous systems have increased their popularity in recent years due to the high performance and reduced energy consumption capabilities provided by using devices such as GPUs or Xeon Phi accelerators. This paper proposes a checkpoint-based fault tolerance solution for heterogeneous applications, allowing them to survive fail-stop failures in the host CPU or in any of the accelerators used. Besides, applications can be restarted changing the host CPU and/or the accelerator device architecture, and adapting the computation to the number of devices available during recovery. The proposed solution is built combining CPPC (ComPiler for Portable Checkpointing), an application-level checkpointing tool, and HPL (Heterogeneous Programming Library), a library that facilitates the development of OpenCL-based applications. Experimental results show the low overhead introduced by the proposal and prove its portability and adaptability benefits.This research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Projects TIN2013-42148-P, TIN2016-75845-P and the predoctoral Grant of Nuria Losada Ref. BES-2014-068066), by EU under the COST Program Action IC1305, Network for Sustainable Ultrascale Computing (NESUS), and by the Galician Government (Xunta de Galicia) and FEDER funds of the EU under the Consolidation Program of Competitive Research (Ref. GRC2013/055)Xunta de Galicia; GRC 2013/05