32 research outputs found

    MASSIVELY PARALLEL OIL RESERVOIR SIMULATION FOR HISTORY MATCHING

    Get PDF

    Automatische Codegenerierung für Massiv Parallele Applikationen in der Numerischen Strömungsmechanik

    Get PDF
    Solving partial differential equations (PDEs) is a fundamental challenge in many application domains in industry and academia alike. With increasingly large problems, efficient and highly scalable implementations become more and more crucial. Today, facing this challenge is more difficult than ever due to the increasingly heterogeneous hardware landscape. One promising approach is developing domain‐specific languages (DSLs) for a set of applications. Using code generation techniques then allows targeting a range of hardware platforms while concurrently applying domain‐specific optimizations in an automated fashion. The present work aims to further the state of the art in this field. As domain, we choose PDE solvers and, in particular, those from the group of geometric multigrid methods. To avoid having a focus too broad, we restrict ourselves to methods working on structured and patch‐structured grids. We face the challenge of handling a domain as complex as ours, while providing different abstractions for diverse user groups, by splitting our external DSL ExaSlang into multiple layers, each specifying different aspects of the final application. Layer 1 is designed to resemble LaTeX and allows inputting continuous equations and functions. Their discretization is expressed on layer 2. It is complemented by algorithmic components which can be implemented in a Matlab‐like syntax on layer 3. All information provided to this point is summarized on layer 4, enriched with particulars about data structures and the employed parallelization. Additionally, we support automated progression between the different layers. All ExaSlang input is processed by our jointly developed Scala code generation framework to ultimately emit C++ code. We particularly focus on how to generate applications parallelized with, e.g., MPI and OpenMP that are able to run on workstations and large‐scale cluster alike. We showcase the applicability of our approach by implementing simple test problems, like Poisson’s equation, as well as relevant applications from the field of computational fluid dynamics (CFD). In particular, we implement scalable solvers for the Stokes, Navier‐Stokes and shallow water equations (SWE) discretized using finite differences (FD) and finite volumes (FV). For the case of Navier‐Stokes, we also extend our implementation towards non‐uniform grids, thereby enabling static mesh refinement, and advanced effects such as the simulated fluid being non‐Newtonian and non‐isothermal

    The development of a new preconditioner by modifying the simply sparse compression matrix to solve electromagnetic method of moments problems

    Get PDF
    The aim of this research was to improve the matrix solution methods for SuperNEC MoM problems, which is an electromagnetic simulation software package used to model antennas, and develop a new preconditioner for the iterative method BICGSTAB(L). This was achieved by firstly implementing the ATLAS BLAS library optimised for a specific computer architecture. The ATLAS code primarily makes use of code generation to build and optimise applications. Comparisons show that the matrix solution times using LU decomposition optimised by ATLAS is improved by between 4.1 and 4.6 times, providing a good coding platform from which to compare other techniques. Secondly the BICGSTAB iterative solution method in SuperNEC was improved by making use of an alternative algorithm BICGSTAB(L). Systems of equations that converged slowly or not at all using BICGSTAB, converged more quickly when using BICGSTAB(L) with L set to 4, despite the high condition numbers in the coefficient matrices. Thirdly a domain decomposition method, Simply Sparse, was characterised. Investigations showed that Simply Sparse is a good compression technique for SuperNEC MoM matrices. The custom Simply Sparse solver also solves large matrix problems more quickly than LU decomposition and scales well with increased problem sizes. LU decomposition is still however quicker for problems smaller than 7000 unknowns as the overheads in compressing the coefficient matrices dominate the Simply Sparse method for small problems. Lastly a new preconditioner for BICGSTAB(L) was developed using a modified form of the Simply Sparse matrix. This was achieved by considering the Simply Sparse matrix to be equivalent to the full coefficient matrix [A] . The largest 1% to 2% of the Simply Sparse elements was selected to form the basis of the preconditioning matrix. These elements were further modified by multiplying them by a large constant i.e. 7 1×10 . The system of equations was then solved using BICGSTAB(L) with L set to 4. The new preconditioned BICGSTAB(L) algorithm is quicker than both LU decomposition and the custom Simply Sparse solution method for problems larger than 5000 unknowns

    Doctor of Philosophy

    Get PDF
    dissertationEmerging trends such as growing architectural diversity and increased emphasis on energy and power efficiency motivate the need for code that adapts to its execution context (input dataset and target architecture). Unfortunately, writing such code remains difficult, and is typically attempted only by a small group of motivated expert programmers who are highly knowledgeable about the relationship between software and its hardware mapping. In this dissertation, we introduce novel abstractions and techniques based on automatic performance tuning that enable both experts and nonexperts (application developers) to produce adaptive code. We present two new frameworks for adaptive programming: Nitro and Surge. Nitro enables expert programmers to specify code variants, or alternative implementations of the same computation, together with meta-information for selecting among them. It then utilizes supervised classification to select an optimal code variant at runtime based on characteristics of the execution context. Surge, on the other hand, provides a high-level nested data-parallel programming interface for application developers to specify computations. It then employs a two-level mechanism to automatically generate code variants and then tunes them using Nitro. The resulting code performs on par with or better than handcrafted reference implementations on both CPUs and GPUs. In addition to abstractions for expressing code variants, this dissertation also presents novel strategies for adaptively tuning them. First, we introduce a technique for dynamically selecting an optimal code variant at runtime based on characteristics of the input dataset. On five high-performance GPU applications, variants tuned using this strategy achieve over 93% of the performance of variants selected through exhaustive search. Next, we present a novel approach based on multitask learning to develop a code variant selection model on a target architecture from training on different source architectures. We evaluate this approach on a set of six benchmark applications and a collection of six NVIDIA GPUs from three distinct architecture generations. Finally, we implement support for combined code variant and frequency selection based on multiple objectives, including power and energy efficiency. Using this strategy, we construct a GPU sorting implementation that provides improved energy and power efficiency with less than a proportional drop in sorting throughput

    A Parallel Implementation of the Glowinski-Pironneau Algorithm for the Modified Stokes Problem

    Get PDF
    In this dissertation we consider a parallel implementation of the Glowinski-Pironneau algorithm for the modified Stokes problem. In particular, we motivate this effort by demonstrating the occurrence of the modified Stokes problem in the time dependent viscoelastic Oldroyd flow setting using Saramito\u27s splitting. We then present an analysis of the Glowinski-Pironneau pressure decomposition for the modified Stokes problem - including numerical error estimates. Next we discuss our parallel finite element method implementation of the pressure decomposition approach. Finally, we present numerical results including errors and performance measures. These measures are also compared with results for a coupled velocity-pressure modified Stokes solver using a publicly available parallel solver

    Accelerating induction machine finite-element simulation with parallel processing

    Get PDF
    Finite element analysis used for detailed electromagnetic analysis and design of electric machines is computationally intensive. A means of accelerating two-dimensional transient finite element analysis, required for induction machine modeling, is explored using graphical processing units (GPUs) for parallel processing. The graphical processing units, widely used for image processing, can provide faster computation times than CPUs alone due to the thousands of small processors that comprise the GPUs. Computations that are suitable for parallel processing using GPUs are calculations that can be decomposed into subsections that are independent and can be computed in parallel and reassembled. The steps and components of the transient finite element simulation are analyzed to determine if using GPUs for calculations can speed up the simulation. The dominant steps of the finite element simulation are preconditioner formation, computation of the sparse iterative solution, and matrix-vector multiplication for magnetic flux density calculation. Due to the sparsity of the finite element problem, GPU-implementation of the sparse iterative solution did not result in faster computation times. The dominant speed-up achieved using the GPUs resulted from matrix-vector multiplication. Simulation results for a benchmark nonlinear magnetic material transient eddy current problem and linear magnetic material transient linear induction machine problem are presented. The finite element analysis program is implemented with MATLAB R2014a to compare sparse matrix format computations to readily available GPU matrix and vector formats and Compute Unified Device Architecture (CUDA) functions linked to MATLAB. Overall speed-up achieved for the simulations resulted in 1.2-3.5 times faster computation of the finite element solution using a hybrid CPU/GPU implementation over the CPU-only implementation. The variation in speed-up is dependent on the sparsity and number of unknowns of the problem
    corecore