2 research outputs found

    Heterogeneous parallel algorithms for computational fluid dynamics on unstructured meshes

    Get PDF
    Frontiers of computational fluid dynamics (CFD) are constantly expanding and eagerly demanding more computational resources. Currently, we are experiencing an rapid evolution in the high performance computing systems driven by power consumption constraints. New HPC nodes incorporate accelerators that are used as math co-processors for increasing the throughput and the FLOP per watt ratio. On the other hand, multi-core CPUs have turned into energy efficient system-on-chip architectures. By doing so, the main components of the node are fused and integrated into a single chip reducing the energy costs. Nowadays, several institutions and governments are investing in the research and development of different aspects of HPC that could lead to the next generations of supercomputers. This initiatives have entitled the problem as the exascale challenge. This goal can only be achieved by incorporating major changes in computer architecture, memory design and network interfaces. The CFD community faces an important challenge: keep the pace at the rapid changes in the HPC resources. The codes and formulations need to be re-design in other to exploit the different levels of parallelism and complex memory hierarchies of the new heterogeneous systems. The main characteristics demanded to the new CFD software are: memory awareness, extreme concurrency, modularity and portability. This thesis is devoted to the study of a CFD algorithm re-factoring for the adoption of new technologies. Our application context is the solution of incompressible flows (DNS or LES) on unstructured meshes. The first approach was using GPUs for accelerating the Poisson solver, that is the most computational intensive part of our application. The positive results obtained in this first step motivated us to port the complete time integration phase of our application. This requires a major redesign of the code. We propose a portable implementation model for CFD applications. The main idea was substituting stencil data structures and kernels by algebraic storage formats and operators. By doing so, the algorithm was restructured into a minimal set of algebraic operations. The implementation strategy consisted in the creation of a low-level algebraic layer for computations on CPUs and GPUs, and a high-level user-friendly discretization layer for CPUs that is fully localized at the preprocessing stage where performance does not play an important role. As a result, at the time-integration phase the code relies only on three algebraic kernels: sparse-matrix-vector product (SpMV), linear combination of two vectors (AXPY) and dot product (DOT). Such a simple set of basic linear algebra operations naturally provides the desired portability to any computing architecture. Special attention was paid at the development of data structures compatibles with the stream processing model. A detailed performance analysis was studied in both sequential and parallel execution engaging up to 128 GPUs in a hybrid CPU/GPU supercomputer. Moreover, we tested the portable implementation model of TermoFluids code in the Mont-Blanc mobile-based supercomputer. The re-design of the kernels exploits a heterogeneous execution model using both computing devices CPU and GPU of the ARM-based nodes. The load balancing between the two computing devices exploits a tabu search strategy that tunes the workload distribution during the preprocessing stage. A comparison of the Mont-Blanc prototypes with high-end supercomputers in terms of the achieved net performance and energy consumption provided some guidelines of the behavior of CFD applications in ARM-based architectures. Finally, we present a memory aware auto-tuned Poisson solver for problems with one Fourier diagonalizable direction. This work was developed and tested in the BlueGene/Q Vesta supercomputer, and aims at demonstrating the relevance of vectorization and memory awareness for fully exploiting the modern energy efficient CPUs.Las fronteras de la din谩mica de fluidos computacional (CFD) est谩n en constante expansi贸n y demandan m谩s y m谩s recursos computacionales. Actualmente, estamos experimentando una evoluci贸n en los sistemas de computaci贸n de alto rendimiento (HPC) impulsado por restricciones de consumo de energ铆a. Los nuevos nodos HPC incorporan aceleradores que se utilizan como co-procesadores para incrementar el rendimiento y la relaci贸n FLOP por vatio. Por otro lado, CPUs multi-core se han convertido en arquitecturas system-on-chip. Hoy en d铆a, varias instituciones y gobiernos est谩n invirtiendo en la investigaci贸n y desarrollo de los diferentes aspectos de HPC que podr铆an llevar a las pr贸ximas generaciones de superordenadores. Estas iniciativas han titulado el problema como el "exascale challenge". Este objetivo s贸lo puede lograrse mediante la incorporaci贸n de cambios importantes en: la arquitectura de ordenador, dise帽o de la memoria y las interfaces de red. La comunidad de CFD se enfrenta a un reto importante: mantener el ritmo a los r谩pidos cambios en las infraestructuras de HPC. Los c贸digos y formulaciones necesitan ser redise帽ados para explotar los diferentes niveles de paralelismo y complejas jerarqu铆as de memoria de los nuevos sistemas heterog茅neos. Las principales caracter铆sticas exigidas al nuevo software CFD son: estructuras de datos, la concurrencia extrema, modularidad y portabilidad. Esta tesis est谩 dedicada al estudio de un modelo de implementation CFD para la adopci贸n de nuevas tecnolog铆as. Nuestro contexto de aplicaci贸n es la soluci贸n de los flujos incompresibles (DNS o LES) en mallas no estructuradas. El primer enfoque se bas贸 en utilizar GPUs para acelerar el solver de Poisson. Los resultados positivos obtenidos en este primer paso nos motivaron a la portabilidad completa de la fase de integraci贸n temporal de nuestra aplicaci贸n. Esto requiere un importante redise帽o del c贸digo. Proponemos un modelo de implementacion portable para aplicaciones de CFD. La idea principal es sustituir las estructuras de datos de los stencils y kernels por formatos de almacenamiento algebraicos y operadores. La estrategia de implementaci贸n consisti贸 en la creaci贸n de una capa algebraica de bajo nivel para los c谩lculos de CPU y GPU, y una capa de discretizaci贸n f谩cil de usar de alto nivel para las CPU. Como resultado, la fase de integraci贸n temporal del c贸digo se basa s贸lo en tres funciones algebraicas: producto de una matriz dispersa con un vector (SPMV), combinaci贸n lineal de dos vectores (AXPY) y producto escalar (DOT). Adem谩s, se prest贸 especial atenci贸n en el desarrollo de estructuras de datos compatibles con el modelo stream processing. Un an谩lisis detallado de rendimiento se ha estudiado tanto en ejecuci贸n secuencial y paralela utilizando hasta 128 GPUs en un superordenador h铆brido CPU / GPU. Por otra parte, hemos probado el nuevo modelo de TermoFluids en el superordenador Mont-Blanc basado en tecnolog铆a m贸vil. El redise帽o de las funciones explota un modelo de ejecuci贸n heterog茅nea utilizando tanto la CPU y la GPU de los nodos basados en arquitectura ARM. El equilibrio de carga entre las dos unidades de c谩lculo aprovecha una estrategia de b煤squeda tab煤 que sintoniza la distribuci贸n de carga de trabajo durante la etapa de preprocesamiento. Una comparaci贸n de los prototipos Mont-Blanc con superordenadores de alta gama en t茅rminos de rendimiento y consumo de energ铆a nos proporcion贸 algunas pautas del comportamiento de las aplicaciones CFD en arquitecturas basadas en ARM. Por 煤ltimo, se presenta una estructura de datos auto-sintonizada para el solver de Poisson en problemas con una direcci贸n diagonalizable mediante una descomposicion de Fourier. Este trabajo fue desarrollado y probado en la superordenador BlueGene / Q Vesta, y tiene por objeto demostrar la relevancia de vectorizaci贸n y las estructuras de datos para aprovechar plenamente las CPUs de los superodenadores modernos

    Parallel Implementation of a Sparse Approximate Inverse Preconditioner

    No full text
    . A parallel implementation of a sparse approximate inverse (spai) preconditioner for distributed memory parallel processors (dmpp) is presented. The fundamental spai algorithm is known to be a useful tool for improving the convergence of iterative solvers for ill-conditioned linear systems. The parallel implementation (parspai) exploits the inherent parallelism in the spai algorithm and the data locality on the dmpps, to solve structurally symmetric (but non-symmetric) matrices, which typically arise when solving partial differential equations (pdes). Some initial performance results are presented which suggest the usefulness of parspai for tackling such large size systems on present day dmpps in a reasonable time. The parspai preconditioner is implemented using the Message Passing Interface (mpi) and is embedded in the parallel library for unstructured mesh problems (plump). 1 Introduction We consider the linear system of equations Ax = b; x; b 2 IR n : (1) Here A is a large and..
    corecore