219 research outputs found

    A hierarchical parallel implementation model for algebra-based CFD simulations on hybrid supercomputers

    Get PDF
    (English) Continuous enhancement in hardware technologies enables scientific computing to advance incessantly and reach further aims. Since the start of the global race for exascale high-performance computing (HPC), massively-parallel devices of various architectures have been incorporated into the newest supercomputers, leading to an increasing hybridization of HPC systems. In this context of accelerated innovation, software portability and efficiency become crucial. Traditionally, scientific computing software development is based on calculations in iterative stencil loops (ISL) over a discretized geometry—the mesh. Despite being intuitive and versatile, the interdependency between algorithms and their computational implementations in stencil applications usually results in a large number of subroutines and introduces an inevitable complexity when it comes to portability and sustainability. An alternative is to break the interdependency between algorithm and implementation to cast the calculations into a minimalist set of kernels. The portable implementation model that is the object of this thesis is not restricted to a particular numerical method or problem. However, owing to the CTTC's long tradition in computational fluid dynamics (CFD) and without loss of generality, this work is targeted to solve transient CFD simulations. By casting discrete operators and mesh functions into (sparse) matrices and vectors, it is shown that all the calculations in a typical CFD algorithm boil down to the following basic linear algebra subroutines: the sparse matrix-vector product, the linear combination of vectors, and the dot product. The proposed formulation eases the deployment of scientific computing software in massively parallel hybrid computing systems and is demonstrated in the large-scale, direct numerical simulation of transient turbulent flows.(Català) La millora contínua en tecnologies de la informàtica possibilita a la comunitat de computació científica avançar incessantment i assolir ulteriors objectius. Des de l'inici de la cursa global per a la computació d'alt rendiment (HPC) d'exa-escala, s'han incorporat dispositius massivament paral·lels de diverses arquitectures als supercomputadors més nous, donant lloc a una creixent hibridació dels sistemes HPC. En aquest context d'innovació accelerada, la portabilitat i l'eficiència del programari esdevenen crucials. Tradicionalment, el desenvolupament de programari informàtic científic es basa en càlculs en bucles de patrons iteratius (ISL) sobre una geometria discretitzada: la malla. Tot i ser intuïtiva i versàtil, la interdependència entre algorismes i les seves implementacions computacionals en aplicacions de patrons sol donar lloc a un gran nombre de subrutines i introdueix una complexitat inevitable quan es tracta de portabilitat i sostenibilitat. Una alternativa és trencar la interdependència entre l'algorisme i la implementació per reduir els càlculs a un conjunt minimalista de subrutines. El model d'implementació portable objecte d'aquesta tesi no es limita a un mètode o problema numèric concret. No obstant això, i a causa de la llarga tradició del CTTC en dinàmica de fluids computacional (CFD) i sense pèrdua de generalitat, aquest treball està dirigit a resoldre simulacions CFD transitòries. Mitjançant la conversió d'operadors discrets i funcions de malla en matrius (disperses) i vectors, es demostra que tots els càlculs d'un algorisme CFD típic es redueixen a les següents subrutines bàsiques d'àlgebra lineal: el producte dispers matriu-vector, la combinació lineal de vectors, i el producte escalar. La formulació proposada facilita el desplegament de programari de computació científica en sistemes informàtics híbrids massivament paral·lels i es demostra el seu rendiment en la simulació numèrica directa de gran escala de fluxos turbulents transitoris.Enginyeria tèrmic

    From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

    Full text link
    Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization is based on higher-order finite differences on multi-block domains. Chemora's capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms.Comment: 18 pages, 4 figures, accepted for publication in Scientific Programmin

    A hierarchical parallel implementation model for algebra-based CFD simulations on hybrid supercomputers

    Get PDF
    (English) Continuous enhancement in hardware technologies enables scientific computing to advance incessantly and reach further aims. Since the start of the global race for exascale high-performance computing (HPC), massively-parallel devices of various architectures have been incorporated into the newest supercomputers, leading to an increasing hybridization of HPC systems. In this context of accelerated innovation, software portability and efficiency become crucial. Traditionally, scientific computing software development is based on calculations in iterative stencil loops (ISL) over a discretized geometry—the mesh. Despite being intuitive and versatile, the interdependency between algorithms and their computational implementations in stencil applications usually results in a large number of subroutines and introduces an inevitable complexity when it comes to portability and sustainability. An alternative is to break the interdependency between algorithm and implementation to cast the calculations into a minimalist set of kernels. The portable implementation model that is the object of this thesis is not restricted to a particular numerical method or problem. However, owing to the CTTC's long tradition in computational fluid dynamics (CFD) and without loss of generality, this work is targeted to solve transient CFD simulations. By casting discrete operators and mesh functions into (sparse) matrices and vectors, it is shown that all the calculations in a typical CFD algorithm boil down to the following basic linear algebra subroutines: the sparse matrix-vector product, the linear combination of vectors, and the dot product. The proposed formulation eases the deployment of scientific computing software in massively parallel hybrid computing systems and is demonstrated in the large-scale, direct numerical simulation of transient turbulent flows.(Català) La millora contínua en tecnologies de la informàtica possibilita a la comunitat de computació científica avançar incessantment i assolir ulteriors objectius. Des de l'inici de la cursa global per a la computació d'alt rendiment (HPC) d'exa-escala, s'han incorporat dispositius massivament paral·lels de diverses arquitectures als supercomputadors més nous, donant lloc a una creixent hibridació dels sistemes HPC. En aquest context d'innovació accelerada, la portabilitat i l'eficiència del programari esdevenen crucials. Tradicionalment, el desenvolupament de programari informàtic científic es basa en càlculs en bucles de patrons iteratius (ISL) sobre una geometria discretitzada: la malla. Tot i ser intuïtiva i versàtil, la interdependència entre algorismes i les seves implementacions computacionals en aplicacions de patrons sol donar lloc a un gran nombre de subrutines i introdueix una complexitat inevitable quan es tracta de portabilitat i sostenibilitat. Una alternativa és trencar la interdependència entre l'algorisme i la implementació per reduir els càlculs a un conjunt minimalista de subrutines. El model d'implementació portable objecte d'aquesta tesi no es limita a un mètode o problema numèric concret. No obstant això, i a causa de la llarga tradició del CTTC en dinàmica de fluids computacional (CFD) i sense pèrdua de generalitat, aquest treball està dirigit a resoldre simulacions CFD transitòries. Mitjançant la conversió d'operadors discrets i funcions de malla en matrius (disperses) i vectors, es demostra que tots els càlculs d'un algorisme CFD típic es redueixen a les següents subrutines bàsiques d'àlgebra lineal: el producte dispers matriu-vector, la combinació lineal de vectors, i el producte escalar. La formulació proposada facilita el desplegament de programari de computació científica en sistemes informàtics híbrids massivament paral·lels i es demostra el seu rendiment en la simulació numèrica directa de gran escala de fluxos turbulents transitoris.Postprint (published version

    Acceleration of a Full-scale Industrial CFD Application with OP2

    Get PDF

    An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling

    Full text link
    We present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination, and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factorization leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite. The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK -- STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices
    corecore