Search CORE

34 research outputs found

Multilevel Parallelization: Grid Methods for Solving Direct and Inverse Problems

Author: Chernykh I
Hildyard M
Krivorot’ko O
Kulikov I
Shishlenin M
Titarenko S
Voronov D
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 12/03/2017
Field of study

In this paper we present grid methods which we have developed for solving direct and inverse problems, and their realization with different levels of optimization. We have focused on solving systems of hyperbolic equations using finite difference and finite volume numerical methods on multicore architectures. Several levels of parallelism have been applied: geometric decomposition of the calculative domain, workload distribution over threads within OpenMP directives, and vectorization. The run-time efficiency of these methods has been investigated. These developments have been tested using the astrophysics code AstroPhi on a hybrid cluster Polytechnic RSC PetaStream (consisting of Intel Xeon Phi accelerators) and a geophysics (seismic wave) code on an Intel Core i7-3930K multicore processor. We present the results of the calculations and study MPI run-time energy efficiency

White Rose Research Online

Accelerated Discontinuous Galerkin Solvers with the Chebyshev Iterative Method on the Graphics Processing Unit

Author: Tullius Toni Kathleen
Publication venue
Publication date: 01/01/2011
Field of study

This work demonstrates implementations of the discontinuous Galerkin (DG) method on graphics processing units (GPU), which deliver improved computational time compared to the conventional central processing unit (CPU). The linear system developed when applying the DG method to an elliptic problem is solved using the GPU. The conjugate gradient (CG) method and the Chebyshev iterative method are the linear system solvers that are compared, to see which is more efficient when computing with the CPU's parallel architecture. When applying both methods, computational times decreased for large problems executed on the GPU compared to CPU; however, CG is the more efficient method compared to the Chebyshev iterative method. In addition, a constant-free upper bound for the DC spectrum applied to the elliptic problem is developed. Few previous works combine the DG method and the GPU. This thesis will provide useful guidelines for the numerical solution of elliptic problems using DG on the GPU

DSpace at Rice University

Numerical enhancements and parallel GPU implementation of the TRACEO3D model

Author: Calazan Rogério de Moraes
Publication venue
Publication date: 01/01/2018
Field of study

Underwater acoustic models provide a fundamental and e cient tool to parametrically investigate hypothesis and physical phenomena through varied environmental conditions of sound propagation underwater. In this sense, requirements for model predictions in a three-dimensional ocean waveguide are expected to become more relevant, and thus expected to become more accurate as the amount of available environmental information (water temperature, bottom properties, etc.) grows. However, despite the increasing performance of modern processors, models that take into account 3D propagation still have a high computational cost which often hampers the usage of such models. Thus, the work presented in this thesis investigates a solution to enhance the numerical and computational performance of the TRACEO3D Gaussian beam model, which is able to handle full three-dimensional propagation. In this context, the development of a robust method for 3D eigenrays search is addressed, which is fundamental for the calculation of a channel impulse response. A remarkable aspect of the search strategy was its ability to provide accurate values of initial eigenray launching angles, even dealing with nonlinearity induced by the complex regime propagation of ray bouncing on the boundaries. In the same way, a optimized method for pressure eld calculation is presented, that accounts for a large numbers of sensors. These numerical enhancements and optimization of the sequential version of TRACEO3D led to signi cant improvements in its performance and accuracy. Furthermore, the present work considered the development of parallel algorithms to take advantage of the GPU architecture, looking carefully to the inherent parallelism of ray tracing and the high workload of predictions for 3D propagation. The combination of numerical enhancements and parallelization aimed to achieve the highest performance of TRACEO3D. An important aspect of this research is that validation and performance assessment were carried out not only for idealized waveguides, but also for the experimental results of a tank scale experiment. The results will demonstrate that a remarkable performance was achieved without compromising accuracy. It is expected that the contributions and remarkable reduction in runtime achieved will certainly help to overcome some of the reserves in employing a 3D model for predictions of acoustic elds

Sapientia

Doctor of Philosophy

Author: Earl Christopher
Publication venue: University of Utah
Publication date: 01/05/2014
Field of study

dissertationIn the static analysis of functional programs, control- ow analysis (k-CFA) is a classic method of approximating program behavior as a infinite state automata. CFA2 and abstract garbage collection are two recent, yet orthogonal improvements, on k-CFA. CFA2 approximates program behavior as a pushdown system, using summarization for the stack. CFA2 can accurately approximate arbitrarily-deep recursive function calls, whereas k-CFA cannot. Abstract garbage collection removes unreachable values from the store/heap. If unreachable values are not removed from a static analysis, they can become reachable again, which pollutes the final analysis and makes it less precise. Unfortunately, as these two techniques were originally formulated, they are incompatible. CFA2's summarization technique for managing the stack obscures the stack such that abstract garbage collection is unable to examine the stack for reachable values. This dissertation presents introspective pushdown control-flow analysis, which manages the stack explicitly through stack changes (pushes and pops). Because this analysis is able to examine the stack by how it has changed, abstract garbage collection is able to examine the stack for reachable values. Thus, introspective pushdown control-flow analysis merges successfully the benefits of CFA2 and abstract garbage collection to create a more precise static analysis. Additionally, the high-performance computing community has viewed functional programming techniques and tools as lacking the efficiency necessary for their applications. Nebo is a declarative domain-specific language embedded in C++ for discretizing partial differential equations for transport phenomena. For efficient execution, Nebo exploits a version of expression templates, based on the C++ template system, which is a type-less, completely-pure, Turing-complete functional language with burdensome syntax. Nebo's declarative syntax supports functional tools, such as point-wise lifting of complex expressions and functional composition of stencil operators. Nebo's primary abstraction is mathematical assignment, which separates what a calculation does from how that calculation is executed. Currently Nebo supports single-core execution, multicore (thread-based) parallel execution, and GPU execution. With single-core execution, Nebo performs on par with the loops and code that it replaces in Wasatch, a pre-existing high-performance simulation project. With multicore (thread-based) execution, Nebo can linearly scale (with roughly 90% efficiency) up to 6 processors, compared to its single-core execution. Moreover, Nebo's GPU execution can be up to 37x faster than its single-core execution. Finally, Wasatch (the pre-existing high-performance simulation project which uses Nebo) can scale up to 262K cores

The University of Utah: J. Willard Marriott Digital Library

Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU

Author: Zhang Zheng
Publication venue: W&M ScholarWorks
Publication date: 01/01/2012
Field of study

Enhancing the match between software executions and hardware features is key to computing efficiency. The match is a continuously evolving and challenging problem. This dissertation focuses on the development of programming system support for exploiting two key features of modern hardware development: the massive parallelism of emerging computational accelerators such as Graphic Processing Units (GPU), and the non-uniformity of cache sharing in modern multicore processors. They are respectively driven by the important role of accelerators in today\u27s general-purpose computing and the ultimate importance of memory performance. This dissertation particularly concentrates on optimizing control flows and memory references, at both compilation and execution time, to tap into the full potential of pure software solutions in taking advantage of the two key hardware features.;Conditional branches cause divergences in program control flows, which may result in serious performance degradation on massively data-parallel GPU architectures with Single Instruction Multiple Data (SIMD) parallelism. On such an architecture, control divergence may force computing units to stay idle for a substantial time, throttling system throughput by orders of magnitude. This dissertation provides an extensive exploration of the solution to this problem and presents program level transformations based upon two fundamental techniques --- thread relocation and data relocation. These two optimizations provide fundamental support for swapping jobs among threads so that the control flow paths of threads converge within every SIMD thread group.;In memory performance, this dissertation concentrates on two aspects: the influence of nonuniform sharing on multithreading applications, and the optimization of irregular memory references on GPUs. In shared cache multicore chips, interactions among threads are complicated due to the interplay of cache contention and synergistic prefetching. This dissertation presents the first systematic study on the influence of non-uniform shared cache on contemporary parallel programs, reveals the mismatch between the software development and underlying cache sharing hierarchies, and further demonstrates it by proposing and applying cache-sharing-aware data transformations that bring significant performance improvement. For the second aspect, the efficiency of GPU accelerators is sensitive to irregular memory references, which refer to the memory references whose access patterns remain unknown until execution time (e.g., A[P[i]]). The root causes of the irregular memory reference problem are similar to that of the control flow problem, while in a more general and complex form. I developed a framework, named G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations

College of William & Mary: W&M Publish

A robust high-resolution hydrodynamic numerical model for surface water flow and transport processes within a flexible software framework

Author: Simons Franz
Publication venue: Technische Universität Berlin
Publication date: 01/01/2020
Field of study

Paralleltitel: Ein robustes hochauflösendes hydrodynamisch-numerisches Modell für Oberflächenabfluss- und Transportprozesse innerhalb eines flexiblen Software-Framework

Hydraulic Engineering Repository

Accelerating Dynamical Density Response Code on Summit and Its Application for Computing the Density Response Function of Vanadium Sesquioxide

Author: Phan Wileam Y
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/12/2021
Field of study

This thesis details the process of porting the Eguiluz group dynamical density response computational platform to the hybrid CPU+GPU environment at the Summit supercomputer at Oak Ridge National Laboratory (ORNL) Leadership Computing Center. The baseline CPU-only version is a Gordon Bell-winning platform within the formally-exact time-dependent density functional theory (TD-DFT) framework using the linearly augmented plane wave (LAPW) basis set. The code is accelerated using a combination of the OpenACC programming model and GPU libraries -- namely, the Matrix Algebra for GPU and Multicore Architectures (MAGMA) library -- as well as exploiting the sparsity pattern of the matrices involved in the matrix-matrix multiplication. Benchmarks show a 12.3x speedup compared to the CPU-only version. This performance boost should accelerate discovery in material and condensed matter physics through computational means. After the hybrid CPU+GPU code has been sufficiently optimized, it is used to study the dynamical density response function of vanadium sesquioxide, and the results are compared with spectroscopic data from non-resonant inelastic X-ray scattering {NIXS} experiments

University of Tennessee, Knoxville: Trace

Photorealistic physically based render engines: a comparative study

Author: Pérez Roig Francisco
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 27/02/2012
Field of study

Pérez Roig, F. (2012). Photorealistic physically based render engines: a comparative study. http://hdl.handle.net/10251/14797.Archivo delegad

RiuNet

Modelling Fluid Structure Interaction problems using Boundary Element Method

Author: Giuliani Nicola
Publication venue: place:Trieste
Publication date: 29/09/2017
Field of study

This dissertation investigates the application of Boundary Element Methods (BEM) to Fluid Structure Interaction (FSI) problems under three main different perspectives. This work is divided in three main parts: i) the derivation of BEM for the Laplace equation and its application to analyze ship-wave interaction problems, ii) the imple- mentation of efficient and parallel BEM solvers addressing the newest challenges of High Performance Computing, iii) the developing of a BEM for the Stokes system and its application to study micro-swimmers.First we develop a BEM for the Laplace equation and we apply it to predict ship-wave interactions making use of an innovative coupling with Finite Element Method stabilization techniques. As well known, the wave pattern around a body depends on the Froude number associated to the flow. Thus, we throughly investigate the robustness and accuracy of the developed methodology assessing the solution dependence on such parameter. To improve the performance and tackle problems with higher number of unknowns, the BEM developed for the Laplace equation is parallelized using OpenSOURCE tech- nique in a hybrid distributed-shared memory environment. We perform several tests to demonstrate both the accuracy and the performance of the parallel BEM developed. In addition, we explore two different possibilities to reduce the overall computational cost from O(N2) to O(N). Firstly we couple the library with a Fast Multiple Method that allows us to reach for higher order of complexity and efficiency. Then we perform a preliminary study on the implementation of a parallel Non Uniform Fast Fourier Transform to be coupled with the newly developed algorithm Sparse Cardinal Sine De- composition (SCSD).Finally we consider the application of the BEM framework to a different kind of FSI problem represented by the Stokes flow of a liquid medium surrounding swimming micro-organisms. We maintain the parallel structure derived for the Laplace equation even in the Stokes setting. Our implementation is able to simulate both prokaryotic and eukaryotic organisms, matching literature and experimental benchmarks. We finally present a deep analysis of the importance of hydrodynamic interactions between the different parts of micro-swimmers in the prevision of optimal swimming conditions, focusing our attention on the study of flagellated \u201crobotic\u201d composite swimmers

Sissa Digital Library