Search CORE

26 research outputs found

Dense and sparse parallel linear algebra algorithms on graphics processing units

Author: Lamas Daviña Alejandro
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 13/11/2018
Field of study

Una línea de desarrollo seguida en el campo de la supercomputación es el uso de procesadores de propósito específico para acelerar determinados tipos de cálculo. En esta tesis estudiamos el uso de tarjetas gráficas como aceleradores de la computación y lo aplicamos al ámbito del álgebra lineal. En particular trabajamos con la biblioteca SLEPc para resolver problemas de cálculo de autovalores en matrices de gran dimensión, y para aplicar funciones de matrices en los cálculos de aplicaciones científicas. SLEPc es una biblioteca paralela que se basa en el estándar MPI y está desarrollada con la premisa de ser escalable, esto es, de permitir resolver problemas más grandes al aumentar las unidades de procesado. El problema lineal de autovalores, Ax = lambda x en su forma estándar, lo abordamos con el uso de técnicas iterativas, en concreto con métodos de Krylov, con los que calculamos una pequeña porción del espectro de autovalores. Este tipo de algoritmos se basa en generar un subespacio de tamaño reducido (m) en el que proyectar el problema de gran dimensión (n), siendo m << n. Una vez se ha proyectado el problema, se resuelve este mediante métodos directos, que nos proporcionan aproximaciones a los autovalores del problema inicial que queríamos resolver. Las operaciones que se utilizan en la expansión del subespacio varían en función de si los autovalores deseados están en el exterior o en el interior del espectro. En caso de buscar autovalores en el exterior del espectro, la expansión se hace mediante multiplicaciones matriz-vector. Esta operación la realizamos en la GPU, bien mediante el uso de bibliotecas o mediante la creación de funciones que aprovechan la estructura de la matriz. En caso de autovalores en el interior del espectro, la expansión requiere resolver sistemas de ecuaciones lineales. En esta tesis implementamos varios algoritmos para la resolución de sistemas de ecuaciones lineales para el caso específico de matrices con estructura tridiagonal a bloques, que se ejecutan en GPU. En el cálculo de las funciones de matrices hemos de diferenciar entre la aplicación directa de una función sobre una matriz, f(A), y la aplicación de la acción de una función de matriz sobre un vector, f(A)b. El primer caso implica un cálculo denso que limita el tamaño del problema. El segundo permite trabajar con matrices dispersas grandes, y para resolverlo también hacemos uso de métodos de Krylov. La expansión del subespacio se hace mediante multiplicaciones matriz-vector, y hacemos uso de GPUs de la misma forma que al resolver autovalores. En este caso el problema proyectado comienza siendo de tamaño m, pero se incrementa en m en cada reinicio del método. La resolución del problema proyectado se hace aplicando una función de matriz de forma directa. Nosotros hemos implementado varios algoritmos para calcular las funciones de matrices raíz cuadrada y exponencial, en las que el uso de GPUs permite acelerar el cálculo.One line of development followed in the field of supercomputing is the use of specific purpose processors to speed up certain types of computations. In this thesis we study the use of graphics processing units as computer accelerators and apply it to the field of linear algebra. In particular, we work with the SLEPc library to solve large scale eigenvalue problems, and to apply matrix functions in scientific applications. SLEPc is a parallel library based on the MPI standard and is developed with the premise of being scalable, i.e. to allow solving larger problems by increasing the processing units. We address the linear eigenvalue problem, Ax = lambda x in its standard form, using iterative techniques, in particular with Krylov's methods, with which we calculate a small portion of the eigenvalue spectrum. This type of algorithms is based on generating a subspace of reduced size (m) in which to project the large dimension problem (n), being m << n. Once the problem has been projected, it is solved by direct methods, which provide us with approximations of the eigenvalues of the initial problem we wanted to solve. The operations used in the expansion of the subspace vary depending on whether the desired eigenvalues are from the exterior or from the interior of the spectrum. In the case of searching for exterior eigenvalues, the expansion is done by matrix-vector multiplications. We do this on the GPU, either by using libraries or by creating functions that take advantage of the structure of the matrix. In the case of eigenvalues from the interior of the spectrum, the expansion requires solving linear systems of equations. In this thesis we implemented several algorithms to solve linear systems of equations for the specific case of matrices with a block-tridiagonal structure, that are run on GPU. In the computation of matrix functions we have to distinguish between the direct application of a matrix function, f(A), and the action of a matrix function on a vector, f(A)b. The first case involves a dense computation that limits the size of the problem. The second allows us to work with large sparse matrices, and to solve it we also make use of Krylov's methods. The expansion of subspace is done by matrix-vector multiplication, and we use GPUs in the same way as when solving eigenvalues. In this case the projected problem starts being of size m, but it is increased by m on each restart of the method. The solution of the projected problem is done by directly applying a matrix function. We have implemented several algorithms to compute the square root and the exponential matrix functions, in which the use of GPUs allows us to speed up the computation.Una línia de desenvolupament seguida en el camp de la supercomputació és l'ús de processadors de propòsit específic per a accelerar determinats tipus de càlcul. En aquesta tesi estudiem l'ús de targetes gràfiques com a acceleradors de la computació i ho apliquem a l'àmbit de l'àlgebra lineal. En particular treballem amb la biblioteca SLEPc per a resoldre problemes de càlcul d'autovalors en matrius de gran dimensió, i per a aplicar funcions de matrius en els càlculs d'aplicacions científiques. SLEPc és una biblioteca paral·lela que es basa en l'estàndard MPI i està desenvolupada amb la premissa de ser escalable, açò és, de permetre resoldre problemes més grans en augmentar les unitats de processament. El problema lineal d'autovalors, Ax = lambda x en la seua forma estàndard, ho abordem amb l'ús de tècniques iteratives, en concret amb mètodes de Krylov, amb els quals calculem una xicoteta porció de l'espectre d'autovalors. Aquest tipus d'algorismes es basa a generar un subespai de grandària reduïda (m) en el qual projectar el problema de gran dimensió (n), sent m << n. Una vegada s'ha projectat el problema, es resol aquest mitjançant mètodes directes, que ens proporcionen aproximacions als autovalors del problema inicial que volíem resoldre. Les operacions que s'utilitzen en l'expansió del subespai varien en funció de si els autovalors desitjats estan en l'exterior o a l'interior de l'espectre. En cas de cercar autovalors en l'exterior de l'espectre, l'expansió es fa mitjançant multiplicacions matriu-vector. Aquesta operació la realitzem en la GPU, bé mitjançant l'ús de biblioteques o mitjançant la creació de funcions que aprofiten l'estructura de la matriu. En cas d'autovalors a l'interior de l'espectre, l'expansió requereix resoldre sistemes d'equacions lineals. En aquesta tesi implementem diversos algorismes per a la resolució de sistemes d'equacions lineals per al cas específic de matrius amb estructura tridiagonal a blocs, que s'executen en GPU. En el càlcul de les funcions de matrius hem de diferenciar entre l'aplicació directa d'una funció sobre una matriu, f(A), i l'aplicació de l'acció d'una funció de matriu sobre un vector, f(A)b. El primer cas implica un càlcul dens que limita la grandària del problema. El segon permet treballar amb matrius disperses grans, i per a resoldre-ho també fem ús de mètodes de Krylov. L'expansió del subespai es fa mitjançant multiplicacions matriu-vector, i fem ús de GPUs de la mateixa forma que en resoldre autovalors. En aquest cas el problema projectat comença sent de grandària m, però s'incrementa en m en cada reinici del mètode. La resolució del problema projectat es fa aplicant una funció de matriu de forma directa. Nosaltres hem implementat diversos algorismes per a calcular les funcions de matrius arrel quadrada i exponencial, en les quals l'ús de GPUs permet accelerar el càlcul.Lamas Daviña, A. (2018). Dense and sparse parallel linear algebra algorithms on graphics processing units [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/112425TESI

RiuNet

Large-scale applications and scalability for problems in the mechanics of soft biological tissues in arterial wall structures

Author: Fischle Andreas
Publication venue
Publication date: 10/03/2015
Field of study

Der Zugriff auf den Volltext ist gesperrt, neue Version unter DuEPublico-ID 37999 An MPI-parallel Newton-Krylov-FETI-DP solver based on FEAP is presented together with applications to nonlinear problems in the quasi-static biomechanics of soft biological tissues. The formulation is based on highly nonlinear hyperelastic anisotropic and poly-convex models. High-resolution computations of the wall stresses in patient-specific arterial wall structures subjected to an interior normal pressure in the physiological regime of the blood pressure (up to 500 [mmHg]) are reported together with results on strong scalability. The weak scalability of Newton-Krylov-FETI-DP is investigated for up to 140 million degrees of freedom using 4096 processor cores on a Cray XT6m supercomputer in a series of simple tension tests. An implementation of a new FEAP-interface called libfw is presented which allows for the flexible unified integration of FEAP into other software packages, e.g., into LifeV. The modifications done to FEAP are dissected and discussed in detail as a case study in order to illustrate possible approaches for the integration of different code components or applications in similar scenarios

Duisburg-Essen Publications Online

A parallel Newton-Krylov-FETI-DP Solver based on FEAP: Large-scale applications and scalability for problems in the mechanics of soft biological tissues in arterial wall structures

Author: Fischle Andreas
Publication venue
Publication date: 20/04/2015
Field of study

An MPI-parallel Newton-Krylov-FETI-DP solver based on FEAP is presented together with applications to nonlinear problems in the quasi-static biomechanics of soft biological tissues. The formulation is based on highly nonlinear hyperelastic anisotropic and poly-convex models. High-resolution computations of the wall stresses in patient-specific arterial wall structures subjected to an interior normal pressure in the physiological regime of the blood pressure (up to 500 [mmHg]) are reported together with results on strong scalability. The weak scalability of Newton-Krylov-FETI-DP is investigated for up to 140 million degrees of freedom using 4096 processor cores on a Cray XT6m supercomputer in a series of simple tension tests. An implementation of a new FEAP-interface called libfw is presented which allows for the flexible unified integration of FEAP into other software packages, e.g., into LifeV. The modifications done to FEAP are dissected and discussed in detail as a case study in order to illustrate possible approaches for the integration of different code components or applications in similar scenarios

Duisburg-Essen Publications Online

Productive and efficient computational science through domain-specific abstractions

Author: Rathgeber Florian
Publication venue: Computing, Imperial College London
Publication date: 01/11/2014
Field of study

In an ideal world, scientific applications are computationally efficient, maintainable and composable and allow scientists to work very productively. We argue that these goals are achievable for a specific application field by choosing suitable domain-specific abstractions that encapsulate domain knowledge with a high degree of expressiveness. This thesis demonstrates the design and composition of domain-specific abstractions by abstracting the stages a scientist goes through in formulating a problem of numerically solving a partial differential equation. Domain knowledge is used to transform this problem into a different, lower level representation and decompose it into parts which can be solved using existing tools. A system for the portable solution of partial differential equations using the finite element method on unstructured meshes is formulated, in which contributions from different scientific communities are composed to solve sophisticated problems. The concrete implementations of these domain-specific abstractions are Firedrake and PyOP2. Firedrake allows scientists to describe variational forms and discretisations for linear and non-linear finite element problems symbolically, in a notation very close to their mathematical models. PyOP2 abstracts the performance-portable parallel execution of local computations over the mesh on a range of hardware architectures, targeting multi-core CPUs, GPUs and accelerators. Thereby, a separation of concerns is achieved, in which Firedrake encapsulates domain knowledge about the finite element method separately from its efficient parallel execution in PyOP2, which in turn is completely agnostic to the higher abstraction layer. As a consequence of the composability of those abstractions, optimised implementations for different hardware architectures can be automatically generated without any changes to a single high-level source. Performance matches or exceeds what is realistically attainable by hand-written code. Firedrake and PyOP2 are combined to form a tool chain that is demonstrated to be competitive with or faster than available alternatives on a wide range of different finite element problems.Open Acces

Spiral - Imperial College Digital Repository

A hybrid parallel framework for computational solid mechanics

Author: Fidkowski Piotr
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2011
Field of study

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Aeronautics and Astronautics, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 95-98).A novel, hybrid parallel C++ framework for computational solid mechanics is developed and presented. The modular and extensible design of this framework allows it to support a wide variety of numerical schemes including discontinuous Galerkin formulations and higher order methods, multiphysics problems, hybrid meshes made of different types of elements and a number of different linear and non-linear solvers. In addition, native, seamless support is included for hardware acceleration by Graphics Processing Units (GPUs) via NVIDIA's CUDA architecture for both single GPU workstations and heterogenous clusters of GPUs. The capabilities of the framework are demonstrated through a series of sample problems, including a laser induced cylindrical shock propagation, a dynamic problem involving a micro-truss array made of millions of elements, and a tension problem involving a shape memory alloy with a multifield formulation to model the superelastic effect.by Piotr Fidkowski.S.M

DSpace@MIT

Compiler Support for Operator Overloading and Algorithmic Differentiation in C++

Author: Hück Alexander
Publication venue
Publication date: 01/01/2020
Field of study

Multiphysics software needs derivatives for, e.g., solving a system of non-linear equations, conducting model verification, or sensitivity studies. In C++, algorithmic differentiation (AD), based on operator overloading (overloading), can be used to calculate derivatives up to machine precision. To that end, the built-in floating-point type is replaced by the user-defined AD type. It overloads all required operators, and calculates the original value and the corresponding derivative based on the chain rule of calculus. While changing the underlying type seems straightforward, several complications arise concerning software and performance engineering. This includes (1) fundamental language restrictions of C++ w.r.t. user-defined types, (2) type correctness of distributed computations with the Message Passing Interface (MPI) library, and (3) identification and mitigation of AD induced overheads. To handle these issues, AD experts may spend a significant amount of time to enhance a code with AD, verify the derivatives and ensure optimal application performance. Hence, in this thesis, we propose a modern compiler-based tooling approach to support and accelerate the AD-enhancement process of C++ target codes. In particular, we make contributions to three aspects of AD. The initial type change - While the change to the AD type in a target code is conceptually straightforward, the type change often leads to a multitude of compiler error messages. This is due to the different treatment of built-in floating-point types and user-defined types by the C++ language standard. Previously legal code constructs in the target code subsequently violate the language standard when the built-in floating-point type is replaced with a user-defined AD type. We identify and classify these problematic code constructs and their root cause is shown. Solutions by localized source transformation are proposed. To automate this rather mechanical process, we develop a static code analyser and source transformation tool, called OO-Lint, based on the Clang compiler framework. It flags instances of these problematic code constructs and applies source transformations to make the code compliant with the requirements of the language standard. To show the overall relevance of complications with user-defined types, OO-Lint is applied to several well-known scientific codes, some of which have already been AD enhanced by others. In all of these applications, except the ones manually treated for AD overloading, problematic code constructs are detected. Type correctness of MPI communication - MPI is the de-facto standard for programming high performance, distributed applications. At the same time, MPI has a complex interface whose usage can be error-prone. For instance, MPI derived data types require manual construction by specifying memory locations of the underlying data. Specifying wrong offsets can lead to subtle bugs that are hard to detect. In the context of AD, special libraries exist that handle the required derivative book-keeping by replacing the MPI communication calls with overloaded variants. However, on top of the AD type change, the MPI communication routines have to be changed manually. In addition, the AD type fundamentally changes memory layout assumptions as it has a different extent than the built-in types. Previously legal layout assumptions have, thus, to be reverified. As a remedy, to detect any type-related errors, we developed a memory sanitizer tool, called TypeART, based on the LLVM compiler framework and the MPI correctness checker MUST. It tracks all memory allocations relevant to MPI communication to allow for checking the underlying type and extent of the typeless memory buffer address passed to any MPI routine. The overhead induced by TypeART w.r.t. several target applications is manageable. AD domain-specific profiling - Applying AD in a black-box manner, without consideration of the target code structure, can have a significant impact on both runtime and memory consumption. An AD expert is usually required to apply further AD-related optimizations for the reduction of these induced overheads. Traditional profiling techniques are, however, insufficient as they do not reveal any AD domain-specific metrics. Of interest for AD code optimization are, e.g., specific code patterns, especially on a function level, that can be treated efficiently with AD. To that end, we developed a static profiling tool, called ProAD, based on the LLVM compiler framework. For each function, it generates the computational graph based on the static data flow of the floating-point variables. The framework supports pattern analysis on the computational graph to identify the optimal application of the chain rule. We show the potential of the optimal application of AD with two case studies. In both cases, significant runtime improvements can be achieved when the knowledge of the code structure, provided by our tool, is exploited. For instance, with a stencil code, a speedup factor of about 13 is achieved compared to a naive application of AD and a factor of 1.2 compared to hand-written derivative code

TUbiblio

tuprints

Geometry–aware finite element framework for multi–physics simulations: an algorithmic and software-centric perspective

Author: Krause Rolf
Zulian Patrick
Publication venue
Publication date: 03/08/2017
Field of study

In finite element simulations, the handling of geometrical objects and their discrete representation is a critical aspect in both serial and parallel scientific software environments. The development of codes targeting such envinronments is subject to great development effort and man-hours invested. In this thesis we approach these issues from three fronts. First, stable and efficient techniques for the transfer of discrete fields between non matching volume or surface meshes are an essential ingredient for the discretization and numerical solution of coupled multi-physics and multi-scale problems. In particular L2-projections allows for the transfer of discrete fields between unstructured meshes, both in the volume and on the surface. We present an algorithm for parallelizing the assembly of the L2-transfer operator for unstructured meshes which are arbitrarily distributed among different processes. The algorithm requires no a priori information on the geometrical relationship between the different meshes. Second, the geometric representation is often a limiting factor which imposes a trade-off between how accurately the shape is described, and what methods can be employed for solving a system of differential equations. Parametric finite-elements and bijective mappings between polygons or polyhedra allow us to flexibly construct finite element discretizations with arbitrary resolutions without sacrificing the accuracy of the shape description. Such flexibility allows employing state-of-the-art techniques, such as geometric multigrid methods, on meshes with almost any shape.t, the way numerical techniques are represented in software libraries and approached from a development perspective, affect both usability and maintainability of such libraries. Completely separating the intent of high-level routines from the actual implementation and technologies allows for portable and maintainable performance. We provide an overview on current trends in the development of scientific software and showcase our open-source library utopia

RERO DOC Digital Library

Task-based multifrontal QR solver for heterogeneous architectures

Author: Lopez Florent
Publication venue
Publication date: 11/12/2015
Field of study

Afin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modèles de programmations basés sur un parallélisme de tâche ont gagné en popularité dans la communauté du calcul scientifique haute performance. Les moteurs d'exécution fournissent une interface de programmation qui correspond à ce paradigme ainsi que des outils pour l'ordonnancement des tâches qui définissent l'application. Dans cette étude, nous explorons la conception de solveurs directes creux à base de tâches, qui représentent une charge de travail extrêmement irrégulière, avec des tâches de granularités et de caractéristiques différentes ainsi qu'une consommation mémoire variable, au-dessus d'un moteur d'exécution. Dans le cadre du solveur qr mumps, nous montrons dans un premier temps la viabilité et l'efficacité de notre approche avec l'implémentation d'une méthode multifrontale pour la factorisation de matrices creuses, en se basant sur le modèle de programmation parallèle appelé "flux de tâches séquentielles" (Sequential Task Flow). Cette approche, nous a ensuite permis de développer des fonctionnalités telles que l'intégration de noyaux dense de factorisation de type "minimisation de cAfin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modèles de programmations basés sur un parallélisme de tâche ont gagné en popularité dans la communauté du calcul scientifique haute performance. Les moteurs d'exécution fournissent une interface de programmation qui correspond à ce paradigme ainsi que des outils pour l'ordonnancement des tâches qui définissent l'application. Dans cette étude, nous explorons la conception de solveurs directes creux à base de tâches, qui représentent une charge de travail extrêmement irrégulière, avec des tâches de granularités et de caractéristiques différentes ainsi qu'une consommation mémoire variable, au-dessus d'un moteur d'exécution. Dans le cadre du solveur qr mumps, nous montrons dans un premier temps la viabilité et l'efficacité de notre approche avec l'implémentation d'une méthode multifrontale pour la factorisation de matrices creuses, en se basant sur le modèle de programmation parallèle appelé "flux de tâches séquentielles" (Sequential Task Flow). Cette approche, nous a ensuite permis de développer des fonctionnalités telles que l'intégration de noyaux dense de factorisation de type "minimisation de cAfin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modèles de programmations basés sur un parallélisme de tâche ont gagné en popularité dans la communauté du calcul scientifique haute performance. Les moteurs d'exécution fournissent une interface de programmation qui correspond à ce paradigme ainsi que des outils pour l'ordonnancement des tâches qui définissent l'application. !!br0ken!!ommunications" (Communication Avoiding) dans la méthode multifrontale, permettant d'améliorer considérablement la scalabilité du solveur par rapport a l'approche original utilisée dans qr mumps. Nous introduisons également un algorithme d'ordonnancement sous contraintes mémoire au sein de notre solveur, exploitable dans le cas des architectures multicoeur, réduisant largement la consommation mémoire de la méthode multifrontale QR avec un impacte négligeable sur les performances. En utilisant le modèle présenté ci-dessus, nous visons ensuite l'exploitation des architectures hétérogènes pour lesquelles la granularité des tâches ainsi les stratégies l'ordonnancement sont cruciales pour profiter de la puissance de ces architectures. Nous proposons, dans le cadre de la méthode multifrontale, un partitionnement hiérarchique des données ainsi qu'un algorithme d'ordonnancement capable d'exploiter l'hétérogénéité des ressources. Enfin, nous présentons une étude sur la reproductibilité de l'exécution parallèle de notre problème et nous montrons également l'utilisation d'un modèle de programmation alternatif pour l'implémentation de la méthode multifrontale. L'ensemble des résultats expérimentaux présentés dans cette étude sont évalués avec une analyse détaillée des performance que nous proposons au début de cette étude. Cette analyse de performance permet de mesurer l'impacte de plusieurs effets identifiés sur la scalabilité et la performance de nos algorithmes et nous aide ainsi à comprendre pleinement les résultats obtenu lors des tests effectués avec notre solveur.To face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. In this study we investigate the design of task-based sparse direct solvers which constitute extremely irregular workloads, with tasks of different granularities and characteristics with variable memory consumption on top of runtime systems. In the context of the qr mumps solver, we prove the usability and effectiveness of our approach with the implementation of a sparse matrix multifrontal factorization based on a Sequential Task Flow parallel programming model. Using this programming model, we developed features such as the integration of dense 2D Communication Avoiding algorithms in the multifrontal method allowing for better scalability compared to the original approach used in qr mumps. In addition we introduced a memory-aware algorithm to control the memory behaviour of our solver and show, in the context of multicore architectures, an important reduction of the memory footprint for the multifrontal QR factorization with a small impact on performance. Following this approach, we move to heterogeneous architectures where task granularity and scheduling strategies are critical to achieve performance. We present, for the multifrontal method, a hierarchical strategy for data partitioning and a scheduling algorithm capable of handling the heterogeneity of resources. Finally we present a study on the reproducibility of executions and the use of alternative programming models for the implementation of the multifrontal method. All the experimental results presented in this study are evaluated with a detailed performance analysis measuring the impact of several identified effects on the performance and scalability. Thanks to this original analysis, presented in the first part of this study, we are capable of fully understanding the results obtained with our solver

Thèses en ligne de l'Université Toulouse III - Paul Sabatier

Parallel Overlapping Schwarz Preconditioners for Incompressible Fluid Flow and Fluid-Structure Interaction Problems

Author: Hochmuth Christian
Publication venue
Publication date: 23/06/2020
Field of study

Efficient methods for the approximation of solutions to incompressible fluid flow and fluid-structure interaction problems are presented. In particular, partial differential equations (PDEs) are derived from basic conservation principles. First, the incompressible Navier-Stokes equations for Newtonian fluids are introduced. This is followed by a consideration of solid mechanical problems. Both, the fluid equations and the equation for solid problems are then coupled and a fluid-structure interaction problem is constructed. Furthermore, a discretization by the finite element method for weak formulations of these problems is described. This spatial discretization of variables is followed by a discretization of the remaining time-dependent parts. An implementation of the discretizations and problems in a parallel C++ software environment is described. This implementation is based on the software package Trilinos. The parallel execution of a program is the essence of High Performance Computing (HPC). HPC clusters are, in general, machines with several tens of thousands of cores. The fastest current machine, as of the TOP500 list from November 2019, has over 2.4 million cores, while the largest machine possesses over 10 million cores. To achieve sufficient accuracy of the approximate solutions, a fine spatial discretization must be used. In particular, fine spatial discretizations lead to systems with large sparse matrices that have to be solved. Iterative preconditioned Krylov methods are among the most widely used and efficient solution strategies for these systems. Robust and efficient preconditioners which possess good scaling behavior for a parallel execution on several thousand cores are the main component. In this thesis, the focus is on parallel algebraic preconditioners for fluid and fluid-structure interaction problems. Therefore, monolithic overlapping Schwarz preconditioners for saddle point problems of Stokes and Navier-Stokes problems are presented. Monolithic preconditioners for incompressible fluid flow problems can significantly improve the convergence speed compared to preconditioners based on block factorizations. In order to obtain numerically scalable algorithms, coarse spaces obtained from the Generalized Dryja-Smith-Widlund (GDSW) and the Reduced dimension GDSW (RGDSW) approach are used. These coarse spaces can be constructed in an essentially algebraic way. Numerical results of the parallel implementation are presented for various incompressible fluid flow problems. Good scalability for up to 11 979 MPI ranks, which corresponds to the largest problem configuration fitting on the employed supercomputer, were achieved. A comparison of these monolithic approaches and commonly used block preconditioners with respect to time-to-solution is made. Similarly, the most efficient construction of two-level overlapping Schwarz preconditioners with GDSW and RGDSW coarse spaces for solid problems is reported. These techniques are then combined to efficiently solve fully coupled monolithic fluid-strucuture interaction problems

Kölner UniversitätsPublikationsServer