85 research outputs found

    Parallel Smoothers for Matrix-based Multigrid Methods on Unstructured Meshes Using Multicore CPUs and GPUs

    Get PDF
    Multigrid methods are efficient and fast solvers for problems typically modeled by partial differential equations of elliptic type. For problems with complex geometries and local singularities stencil-type discrete operators on equidistant Cartesian grids need to be replaced by more flexible concepts for unstructured meshes in order to properly resolve all problem-inherent specifics and for maintaining a moderate number of unknowns. However, flexibility in the meshes goes along with severe drawbacks with respect to parallel execution – especially with respect to the definition of adequate smoothers. This point becomes in particular pronounced in the framework of fine-grained parallelism on GPUs with hundreds of execution units. We use the approach of matrixbased multigrid that has high flexibility and adapts well to the exigences of modern computing platforms. In this work we investigate multi-colored Gauß-Seidel type smoothers, the power(q)-pattern enhanced multi-colored ILU(p) smoothers with fillins

    Asynchronous and Multiprecision Linear Solvers - Scalable and Fault-Tolerant Numerics for Energy Efficient High Performance Computing

    Get PDF
    Asynchronous methods minimize idle times by removing synchronization barriers, and therefore allow the efficient usage of computer systems. The implied high tolerance with respect to communication latencies improves the fault tolerance. As asynchronous methods also enable the usage of the power and energy saving mechanisms provided by the hardware, they are suitable candidates for the highly parallel and heterogeneous hardware platforms that are expected for the near future

    Status and Future Perspectives for Lattice Gauge Theory Calculations to the Exascale and Beyond

    Full text link
    In this and a set of companion whitepapers, the USQCD Collaboration lays out a program of science and computing for lattice gauge theory. These whitepapers describe how calculation using lattice QCD (and other gauge theories) can aid the interpretation of ongoing and upcoming experiments in particle and nuclear physics, as well as inspire new ones.Comment: 44 pages. 1 of USQCD whitepapers

    A Full-Depth Amalgamated Parallel 3D Geometric Multigrid Solver for GPU Clusters

    Get PDF
    Numerical computations of incompressible flow equations with pressure-based algorithms necessitate the solution of an elliptic Poisson equation, for which multigrid methods are known to be very efficient. In our previous work we presented a dual-level (MPI-CUDA) parallel implementation of the Navier-Stokes equations to simulate buoyancy-driven incompressible fluid flows on GPU clusters with simple iterative methods while focusing on the scalability of the overall solver. In the present study we describe the implementation and performance of a multigrid method to solve the pressure Poisson equation within our MPI-CUDA parallel incompressible flow solver. Various design decisions and algorithmic choices for multigrid methods are explored in light of NVIDIA’s recent Fermi architecture. We discuss how unique aspects of an MPI-CUDA implementation for GPU clusters is related to the software choices made to implement the multigrid method. We propose a new coarse grid solution method of embedded multigrid with amalgamation and show that the parallel implementation retains the numerical efficiency of the multigrid method. Performance measurements on the NCSA Lincoln and TACC Longhorn clusters are presented for up to 64 GPUs

    Fast and accurate finite-element multigrid solvers for PDE simulations on GPU clusters

    Get PDF
    Der wichtigste Beitrag dieser Dissertation ist es aufzuzeigen, dass Grafikprozessoren (GPUs) als Repräsentanten der Entwicklung hin zu Vielkern-Architekturen sehr gut geeignet sind zur schnellen und genauen Lösung großer, dünn besetzter linearer Gleichungssysteme, insbesondere mit parallelen Mehrgittermethoden auf heterogenen Rechenclustern. Solche Systeme treten bspw. bei der Diskretisierung (elliptischer) partieller Differentialgleichungen mittels finiter Elemente auf. Wir demonstrieren Beschleunigungsfaktoren von mindestens einer Größenordnung gegenüber konventionellen, hochoptimierten CPU-Implementierungen, ohne Verlust von Genauigkeit und Funktionsumfang. Im Detail liefert diese Dissertation die folgenden Beiträge: Berechnungen in einfach genauer Fließkommadarstellung können für die hier betrachteten Problemklassen nicht ausreichen. Wir greifen die Methode gemischt genauer iterativer Verfeinerung (Nachiteration) wieder auf, um nicht nur die Genauigkeit von berechneten Lösungen zu verbessern, sondern vielmehr die Effizienz des Lösungsprozesses als ganzes zu steigern. Sowohl auf CPUs als auch auf GPUs demonstrieren wir eine deutliche Leistungssteigerung ohne Genauigkeitsverlust im Vergleich zur Berechnung in höherer Fliesskomma-Genauigkeit. Wir präsentieren effiziente Parallelisierungstechniken für Mehrgitter-Löser auf Grafik-Hardware, insbesondere für numerisch starke Glätter und Vorkonditionierer, die für stark anisotrope Gitter und Operatoren geeignet sind. Ein Beispiel ist die Entwicklung einer effizienten Reformulierung des Verfahrens der zyklischen Reduktion für die Lösung tridiagonaler Gleichungssysteme. Im Hinblick auf Hardware-orientierte Numerik analysieren wir sorgfältig den Kompromiss zwischen numerischer und Laufzeit-Effizienz für inexakte Parallelisierungstechniken, die einige der inhärent sequentiellen Charakteristiken solcher starker Glätter zugunsten besserer Parallelisierungseigenschaften entkoppeln. Die Reimplementierung großer, etablierter Softwarepakete zur Anpassung auf neue Hardwareplattformen ist oft inakzeptabel teuer. Wir entwickeln einen "minimalinvasiven" Zugang zur Integration von Co-Prozessoren wie GPUs in FEAST, einem exemplarischen finite Elemente Diskretisierungs- und Löserpaket. Der Hauptvorteil unserer Technik ist, dass Applikationen, die auf FEAST aufsetzen, nicht geändert werden müssen um von der Beschleunigung durch solche Co-Prozessoren zu profitieren. Wir evaluieren unseren Zugang auf großen GPU-beschleunigten Rechenclustern für klassische Benchmarkprobleme aus der linearisierten Elastizität und der Simulation stationärer laminarer Strömungsvorgänge, und beobachten gute Beschleunigungsfaktoren und gute schwache Skalierbarkeit. Die maximal erreichbare Beschleunigung wird zudem analysiert und theoretisch modelliert, um bspw. Vorhersagen treffen zu können. Weiterhin fassen wir die historische Entwicklung des Forschungsgebiets "wissenschaftliches Rechnen auf Grafikhardware" seit 2001/2002 zusammen, d.h. die Entwicklung von GPGPU als obskures Nischenthema hin zum fachübergreifenden Einsatz heute. Die Darstellung umfasst gleichermaßen die Hardware und das Programmiermodell und beinhaltet eine ausgiebige Bibliografie von Veröffentlichungen im Bereich der Simulation von PDE-Problemen auf GPUs.The main contribution of this thesis is to demonstrate that graphics processors (GPUs) as representatives of emerging many-core architectures are very well-suited for the fast and accurate solution of large sparse linear systems of equations, using parallel multigrid methods on heterogeneous compute clusters. Such systems arise for instance in the discretisation of (elliptic) partial differential equations with finite elements. We report on at least one order of magnitude speedup over highly-tuned conventional CPU implementations, without sacrificing neither accuracy nor functionality. In more detail, this thesis includes the following contributions: Single precision floating point computations may be insufficient for the class of problems considered in this thesis. We revisit mixed precision iterative refinement techniques to not only increase the accuracy of computed results, but also to increase the efficiency of the solution process. Both on CPUs and on GPUs, we demonstrate a significant performance improvement without loss of accuracy compared to computing in high precision only. We present efficient parallelisation techniques for multigrid solvers on graphics hardware, in particular for numerically strong smoothers and preconditioners that are suitable for highly anisotropic grids and operators. For instance, an efficient formulation of the cyclic reduction algorithm to solve tridiagonal systems is developed. In view of hardware-oriented numerics, we carefully analyse the trade-off between numerical and runtime performance for inexact parallelisation techniques that decouple some of the inherently sequential characteristics of strong smoothing operators. For large-scale established software frameworks, the re-implementation tailored to novel hardware platforms is often prohibitively expensive. We develop a 'minimally invasive' approach to integrate support for co-processor hardware like GPUs into FEAST, a finite element discretisation and solver toolbox. Our technique has the major advantage that applications built on top of the toolbox do not have to be changed at all to benefit from co-processor acceleration. The approach is evaluated for benchmark problems in linearised elasticity and stationary laminar flow computed on large-scale GPU-enhanced clusters. Good speedup factors and near-ideal weak scalability are observed. The achievable speedup is analysed and a theoretical speedup model is presented. Finally, we provide a historical overview of scientific computing on graphics hardware since the early beginnings in 2001/2002, when GPGPU was an obscure research topic pursued by few, to the widespread adoption nowadays. We discuss the evolution of the hardware and the programming model, and provide a comprehensive bibliography of publications related to PDE simulations on GPUs

    Lecture 06: The Impact of Computer Architectures on the Design of Algebraic Multigrid Methods

    Get PDF
    Algebraic multigrid (AMG) is a popular iterative solver and preconditioner for large sparse linear systems. When designed well, it is algorithmically scalable, enabling it to solve increasingly larger systems efficiently. While it consists of various highly parallel building blocks, the original method also consisted of various highly sequential components. A large amount of research has been performed over several decades to design new components that perform well on high performance computers. As a matter of fact, AMG has shown to scale well to more than a million processes. However, with single-core speeds plateauing, future increases in computing performance need to rely on more complicated, often heterogenous computer architectures, which provide new challenges for efficient implementations of AMG. To meet these challenges and yield fast and efficient performance, solvers need to exhibit extreme levels of parallelism, and minimize data movement. In this talk, we will give an overview on how AMG has been impacted by the various architectures of high-performance computers to date and discuss our current efforts to continue to achieve good performance on emerging computer architectures

    Hybrid multigrid methods for high-order discontinuous Galerkin discretizations

    Full text link
    The present work develops hybrid multigrid methods for high-order discontinuous Galerkin discretizations of elliptic problems. Fast matrix-free operator evaluation on tensor product elements is used to devise a computationally efficient PDE solver. The multigrid hierarchy exploits all possibilities of geometric, polynomial, and algebraic coarsening, targeting engineering applications on complex geometries. Additionally, a transfer from discontinuous to continuous function spaces is performed within the multigrid hierarchy. This does not only further reduce the problem size of the coarse-grid problem, but also leads to a discretization most suitable for state-of-the-art algebraic multigrid methods applied as coarse-grid solver. The relevant design choices regarding the selection of optimal multigrid coarsening strategies among the various possibilities are discussed with the metric of computational costs as the driving force for algorithmic selections. We find that a transfer to a continuous function space at highest polynomial degree (or on the finest mesh), followed by polynomial and geometric coarsening, shows the best overall performance. The success of this particular multigrid strategy is due to a significant reduction in iteration counts as compared to a transfer from discontinuous to continuous function spaces at lowest polynomial degree (or on the coarsest mesh). The coarsening strategy with transfer to a continuous function space on the finest level leads to a multigrid algorithm that is robust with respect to the penalty parameter of the SIPG method. Detailed numerical investigations are conducted for a series of examples ranging from academic test cases to more complex, practically relevant geometries. Performance comparisons to state-of-the-art methods from the literature demonstrate the versatility and computational efficiency of the proposed multigrid algorithms
    corecore