23 research outputs found

    An Evaluation of Emerging Many-Core Parallel Programming Models

    Get PDF

    Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes

    Get PDF
    The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this paper, we study the benefits and limits of replacing the highly specialized internal scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them the opportunity to process and optimize its traversal in order to maximize the algorithm efficiency for the targeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its native internal scheduler, PaRSEC, and StarPU frameworks, on different execution environments, is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014

    Productivity, performance, and portability for computational fluid dynamics applications

    Get PDF
    Hardware trends over the last decade show increasing complexity and heterogeneity in high performance computing architectures, which presents developers of CFD applications with three key challenges; the need for achieving good performance, being able to utilise current and future hardware by being portable, and doing so in a productive manner. These three appear to contradict each other when using traditional programming approaches, but in recent years, several strategies such as template libraries and Domain Specific Languages have emerged as a potential solution; by giving up generality and focusing on a narrower domain of problems, all three can be achieved. This paper gives an overview of the state-of-the-art for delivering performance, portability, and productivity to CFD applications, ranging from high-level libraries that allow the symbolic description of PDEs to low-level techniques that target individual algorithmic patterns. We discuss advantages and challenges in using each approach, and review the performance benchmarking literature that compares implementations for hardware architectures and their programming methods, giving an overview of key applications and their comparative performance

    Accelerating advanced preconditioning methods on hybrid architectures

    Get PDF
    Un gran número de problemas, en diversas áreas de la ciencia y la ingeniería, involucran la solución de sistemas dispersos de ecuaciones lineales de gran escala. En muchos de estos escenarios, son además un cuello de botella desde el punto de vista computacional, y por esa razón, su implementación eficiente ha motivado una cantidad enorme de trabajos científicos. Por muchos años, los métodos directos basados en el proceso de la Eliminación Gaussiana han sido la herramienta de referencia para resolver dichos sistemas, pero la dimensión de los problemas abordados actualmente impone serios desafíos a la mayoría de estos algoritmos, considerando sus requerimientos de memoria, su tiempo de cómputo y la complejidad de su implementación. Propulsados por los avances en las técnicas de precondicionado, los métodos iterativos se han vuelto más confiables, y por lo tanto emergen como alternativas a los métodos directos, ofreciendo soluciones de alta calidad a un menor costo computacional. Sin embargo, estos avances muchas veces son relativos a un problema específico, o dotan a los precondicionadores de una complejidad tal, que su aplicación en diversos problemas se vuelve poco práctica en términos de tiempo de ejecución y consumo de memoria. Como respuesta a esta situación, es común la utilización de estrategias de Computación de Alto Desempeño, ya que el desarrollo sostenido de las plataformas de hardware permite la ejecución simultánea de cada vez más operaciones. Un claro ejemplo de esta evolución son las plataformas compuestas por procesadores multi-núcleo y aceleradoras de hardware como las Unidades de Procesamiento Gráfico (GPU). Particularmente, las GPU se han convertido en poderosos procesadores paralelos, capaces de integrar miles de núcleos a precios y consumo energético razonables.Por estas razones, las GPU son ahora una plataforma de hardware de gran importancia para la ciencia y la ingeniería, y su uso eficiente es crucial para alcanzar un buen desempeño en la mayoría de las aplicaciones. Esta tesis se centra en el uso de GPUs para acelerar la solución de sistemas dispersos de ecuaciones lineales usando métodos iterativos precondicionados con técnicas modernas. En particular, se trabaja sobre ILUPACK, que ofrece implementaciones de los métodos iterativos más importantes, y presenta un interesante y moderno precondicionador de tipo ILU multinivel. En este trabajo, se desarrollan versiones del precondicionador y de los métodos incluidos en el paquete, capaces de explotar el paralelismo de datos mediante el uso de GPUs sin afectar las propiedades numéricas del precondicionador. Además, se habilita y analiza el uso de las GPU en versiones paralelas existentes, basadas en paralelismo de tareas para plataformas de memoria compartida y distribuida. Los resultados obtenidos muestran una sensible mejora en el tiempo de ejecución de los métodos abordados, así como la posibilidad de resolver problemas de gran escala de forma eficiente

    Energy-Aware High Performance Computing

    Get PDF
    High performance computing centres consume substantial amounts of energy to power large-scale supercomputers and the necessary building and cooling infrastructure. Recently, considerable performance gains resulted predominantly from developments in multi-core, many-core and accelerator technology. Computing centres rapidly adopted this hardware to serve the increasing demand for computational power. However, further performance increases in large-scale computing systems are limited by the aggregate energy budget required to operate them. Power consumption has become a major cost factor for computing centres. Furthermore, energy consumption results in carbon dioxide emissions, a hazard for the environment and public health; and heat, which reduces the reliability and lifetime of hardware components. Energy efficiency is therefore crucial in high performance computing

    Etude de l'adéquation des machines Exascale pour les algorithmes implémentant la méthode du Reverse Time Migation

    Get PDF
    As we are expecting Exascale systems for the 2018-2020 time frame, performance analysis and characterization of applications for new processor architectures and large scale systems are important tasks that permit to anticipate the required changes to efficiently exploit the future HPC systems. This thesis focuses on seismic imaging applications used for modeling complex physical phenomena, in particular the depth imaging application called Reverse Time Migration (RTM). My first contribution consists in characterizing and modeling the performance of the computational core of RTM which is based on finite-difference time-domain (FDTD) computations. I identify and explore the major tuning parameters influencing performance and the interaction between the architecture and the application. The second contribution is an analysis to identify the challenges for a hybrid and heterogeneous implementation of FDTD for manycore architectures. We target Intel’s first Xeon Phi co-processor, the Knights Corner. This architecture is an interesting proxy for our study since it contains some of the expected features of an Exascale system: concurrency and heterogeneity.My third contribution is an extension of the performance analysis and modeling to the full RTM. This adds communications and IOs to the computation part. RTM is a data intensive application and requires the storage of intermediate values of the computational field resulting in expensive IO accesses. My fourth contribution is the final measurement and model validation of my hybrid RTM implementation on a large system. This has been done on Stampede, a machine of the Texas Advanced Computing Center (TACC), which allows us to test the scalability up to 64 nodes each containing one 61-core Xeon Phi and two 8-core CPUs for a total close to 5000 heterogeneous coresLa caractérisation des applications en vue de les préparer pour les nouvelles architectures et les porter sur des systèmes très étendus est une étape importante pour pouvoir anticiper les modifications nécessaires. Comme les machines Exascale sont prévues pour la période 2018-2020, l'étude des applications et leur préparation pour ces machines s'avèrent donc essentielles. Nous nous intéressons aux applications d'imagerie sismique et en particulier à l'application Reverse Time Migration (RTM) car elle est très utilisée par les pétroliers dans le cadre de l'exploration sismique.La première partie de nos travaux a porté sur l'étude du cœur de calcul de l'application RTM qui consiste en un calcul de différences finies dans le domaine temporel (FDTD). Nous avons caractérisé cette partie de l'application en soulevant les aspects architecturaux des machines actuelles ayant un fort impact sur la performance, notamment les caches, les bandes passantes et le prefetching. Cette étude a abouti à l'élaboration d'un modèle de performance permettant de prédire le trafic DRAM des FDTD. La deuxième partie de la thèse se focalise sur l'impact de l'hétérogénéité et le parallélisme sur la FDTD et sur RTM. Nous avons choisi l'architecture manycore d’Intel, Xeon Phi, et nous avons étudié une implémentation "native" et une implémentation hétérogène et hybride, la version "symmetric". Enfin, nous avons porté l'application RTM sur un cluster hétérogène, Stampede du Texas Advanced Computing Center (TACC), où nous avons effectué des tests de scalabilité allant jusqu'à 64 nœuds contenant des coprocesseurs Xeon Phi et des processeurs Sandy Bridge ce qui correspond à presque 5000 cœur

    Modeling for inversion in exploration geophysics

    Get PDF
    Seismic inversion, and more generally geophysical exploration, aims at better understanding the earth's subsurface, which is one of today's most important challenges. Firstly, it contains natural resources that are critical to our technologies such as water, minerals and oil and gas. Secondly, monitoring the subsurface in the context of CO2 sequestration, earthquake detection and global seismology are of major interests with regard to safety and the environment hazards. However, the technologies to monitor the subsurface or find resources are scientifically extremely challenging. Seismic inversion can be formulated as a mathematical optimization problem that minimizes the difference between field recorded data and numerically modeled synthetic data. The process of solving this optimization problem then requires to numerically model, thousands of times, wave-propagation in large three-dimensional representations of part of the earth subsurface. The mathematical and computational complexity of this problem, therefore, calls for software design that abstracts these requirements and facilitates algorithm and software development. My thesis addresses some of the challenges that arise from these problems; mainly the computational cost and access to the right software for research and development. In the first part, I will discuss a performance metric that improves the current runtime-only benchmarks in exploration geophysics. This metric, the roofline model, first provides insight at the hardware level of the performance of a given implementation relative to the maximum achievable performance. Second, this study demonstrates that the choice of numerical discretization has a major impact on the achievable performance depending on the hardware at hand and shows that a flexible framework with respect to the discretization parameters is necessary. In the second part, I will introduce and describe Devito, a symbolic finite-difference DSL that provides a high-level interface to the definition of partial differential equations (PDE) such as the wave equation. Devito, from the symbolic definition of PDEs, then generates and compiles highly optimized C code on-the-fly to compute the solution of the PDE. The combination of the high-level abstractions and the just-in-time compiler enable research for geophysical exploration and PDE-constrainted optimization based on the paradigm of separation of concerns. This allows researchers to concentrate on their respective field of study while having access to computationally performant solvers with a flexible and easy to use interface to successfully implement complex representations of the physics. The second part of my thesis will be split into two sub-parts; first describing the symbolic application programming interface (API), before describing and benchmarking the just-in-time compiler. I will end my thesis with concluding remarks, the latest developments and a brief description of projects that were enabled by Devito.Ph.D

    Aceleración de una herramienta para la predicción de energía solar mediante arquitecturas masivamente paralelas

    Get PDF
    En la última década, Uruguay ha comenzado a incorporar fuertemente la energía eólica y solar a su matriz energética. La inclusión de este tipo de fuentes de energía para abastecer la red eléctrica presenta un gran desafío al momento de administrar su uso, principalmente debido a su flujo de carácter fluctuante. Considerando esta situación, y con el objetivo de simplificar el trabajo de despacho de carga (que se encarga de administrar eficientemente los recursos energéticos presentes en la matriz), desde la Facultad de Ingeniería se ha desarrollado una herramienta capaz de predecir la generación de energía solar fotovoltaica en el país para un horizonte de tiempo de 96 horas. Uno de los principales inconvenientes de dicha herramienta es su elevado costo computacional, lo que resulta en tiempos de ejecución restrictivos. Esta tesis aborda el estudio de la herramienta mencionada, haciendo foco especialmente en la componente que más tiempo y recursos requiere, el modelo numérico de circulación general de la atmósfera Weather Research and Forecasting (WRF). En una primera fase de este trabajo se analiza el tiempo de ejecución de dicho modelo, concluyendo que una de las etapas más costosa es el cómputo de la radiación solar, debido, entre otras cosas, a la precisión numérica que se requiere en estos cálculos. A partir de esta situación, en el presente trabajo se propone una nueva arquitectura de software asincrónica que permita desacoplar y calcular de forma paralela la radiación solar con el resto de las propiedades atmosféricas presentes en el WRF, siguiendo un patrón de paralelismo de tipo pipeline. Adicionalmente, se aborda el portado de una porción del cálculo de la radiación a un coprocesador masivamente paralelo, concretamente una GPU (Graphics Processing Unit) y/o un procesador XeonPhi, con el objetivo de disminuir la demanda de cómputo sobre la CPU. La evaluación experimental de esta propuesta en un escenario de doce plantas fotovoltaicas en el territorio uruguayo permite concluir que la arquitectura asincrónica logra disminuir los tiempos de ejecución del modelo original en un 10 % aproximadamente, cuando se consideran equipos multicore con una gran cantidad de núcleos. Adicionalmente, la extensión de esta arquitectura permite incorporar exitosamente la capacidad de cómputo de un coprocesador (GPU o Xeon-Phi), alcanzando mejoras de entre un 25 % a un 30 % en el tiempo total del modelo cuando se combinan ambas estrategias (asincronismo y uso de dispositivos de cómputo secundario).Over the last decade, Uruguay has begun a strong incorporation of eolic and solar energy to its energy matrix. The inclusion of this type of energy sources to supply the power grid poses a significant challenge at the moment of managing its use, mainly because of its variable flux. Considering this situation, and in order to simplify the power dispatch task (which efficiently manages the energy resources in the matrix), the Facultad de Ingeniería has developed a tool capable of predicting the generation of photovoltaic solar energy in the country for a 96-hour time horizon. One of the main drawbacks of said tool is its high computational cost, which results in restrictive runtimes. This thesis addresses the study of the aforementioned tool, focusing especially on the most resource- and time-consuming component, the numerical weather prediction model Weather Research and Forecasting (WRF). In a first stage of this work, the runtime of said model is assessed, concluding that one of the most expensive steps is the solar radiation computation, because of, inter alia, the numerical precision required in these calculations. Starting from this situation, this work proposes a new asynchronous software architecture which enables decoupling computation of solar radiation and its parallel calculation with the remaining atmospheric properties in the WRF, following a pipeline parallel strategy. Additionally, offloading of a portion of the radiation calculation to a co-processor is addressed, specifically a GPU (Graphics Processing Unit) and/or a Xeon-Phi processor, in order to decrease the computation load on the CPU. Experimental assessment of this proposal in a twelve-photovoltaic-facility scenario in Uruguayan land makes it possible to conclude that asynchronous architecture decreases runtimes of the original model by approximately 10 %, when considering multicore equipment with a large amount of cores. Furthermore, the extent of this architecture enables the successful incorporation of the computation ability of a co-processor (GPU o Xeon-Phi), reaching improvements of between 25 % and 30 % in the total execution time of the model when both strategies are combined (asynchronism and use of secondary computation devices)
    corecore