9 research outputs found
A scalable, efficient scheme for evaluation of stencil computations over unstructured meshes
pre-printStencil computations are a common class of operations that appear in many computational scientific and engineering applications. Stencil computations often benefit from compile-time analysis, exploiting data-locality, and parallelism. Post-processing of discontinuous Galerkin (dG) simulation solutions with B-spline kernels is an example of a numerical method which requires evaluating computationally intensive stencil operations over a mesh. Previous work on stencil computations has focused on structured meshes, while giving little attention to unstructured meshes. Performing stencil operations over an unstructured mesh requires sampling of heterogeneous elements which often leads to inefficient memory access patterns and limits data locality/reuse. In this paper, we present an efficient method for performing stencil computations over unstructured meshes which increases data-locality and cache efficiency, and a scalable approach for stencil tiling and concurrent execution. We provide experimental results in the context of post-processing of dG solutions that demonstrate the effectiveness of our approach
Dynamic earthquake rupture modelled with an unstructured 3-D spectral element method applied to the 2011 M9 Tohoku earthquake
An important goal of computational seismology is to simulate dynamic earthquake rupture and strong ground motion in realistic models that include crustal heterogeneities and complex fault geometries. To accomplish this, we incorporate dynamic rupture modelling capabilities in a spectral element solver on unstructured meshes, the 3-D open source code SPECFEM3D, and employ state-of-the-art software for the generation of unstructured meshes of hexahedral elements. These tools provide high flexibility in representing fault systems with complex geometries, including faults with branches and non-planar faults. The domain size is extended with progressive mesh coarsening to maintain an accurate resolution of the static field. Our implementation of dynamic rupture does not affect the parallel scalability of the code. We verify our implementation by comparing our results to those of two finite element codes on benchmark problems including branched faults. Finally, we present a preliminary dynamic rupture model of the 2011 M_w 9.0 Tohoku earthquake including a non-planar plate interface with heterogeneous frictional properties and initial stresses. Our simulation reproduces qualitatively the depth-dependent frequency content of the source and the large slip close to the trench observed for this earthquake
Fast GPU-Based Seismogram Simulation From Microseismic Events in Marine Environments Using Heterogeneous Velocity Models
A novel approach is presented for fast generation of synthetic seismograms
due to microseismic events, using heterogeneous marine velocity models. The
partial differential equations (PDEs) for the 3D elastic wave equation have
been numerically solved using the Fourier domain pseudo-spectral method which
is parallelizable on the graphics processing unit (GPU) cards, thus making it
faster compared to traditional CPU based computing platforms. Due to
computationally expensive forward simulation of large geological models,
several combinations of individual synthetic seismic traces are used for
specified microseismic event locations, in order to simulate the effect of
realistic microseismic activity patterns in the subsurface. We here explore the
patterns generated by few hundreds of microseismic events with different source
mechanisms using various combinations, both in event amplitudes and origin
times, using the simulated pressure and three component particle velocity
fields via 1D, 2D and 3D seismic visualizations.Shell Projects and Technolog
AxiSEM: broadband 3-D seismic wavefields in axisymmetric media
We present a methodology to compute 3-D global seismic wavefields for
realistic earthquake sources in visco-elastic anisotropic media, covering
applications across the observable seismic frequency band with moderate
computational resources. This is accommodated by mandating axisymmetric
background models that allow for a multipole expansion such that only a 2-D
computational domain is needed, whereas the azimuthal third dimension is
computed analytically on the fly. This dimensional collapse opens doors for
storing space–time wavefields on disk that can be used to compute
Fréchet sensitivity kernels for waveform tomography. We use the
corresponding publicly available AxiSEM (<a href="www.axisem.info"target="_blank">www.axisem.info</a>) open-source
spectral-element code, demonstrate its excellent scalability on
supercomputers, a diverse range of applications ranging from normal modes to
small-scale lowermost mantle structures, tomographic models, and comparison
with observed data, and discuss further avenues to pursue with this
methodology
Forward and adjoint simulations of seismic wave propagation on emerging large-scale GPU architectures
Computational seismology is an area of wide sociological and economic impact, ranging from earthquake risk assessment to subsurface imaging and oil and gas exploration. At the core of these simulations is the modeling of wave propagation in a complex medium. Here we report on the extension of the high-order finite-element seismic wave simulation package SPECFEM3D to support the largest scale hybrid and homogeneous supercomputers. Starting from an existing highly tuned MPI code, we migrated to a CUDA version. In order to be of immediate impact to the science mission of computational seismologists, we had to port the entire production package, rather than just individual kernels. One of the challenges in parallelizing finite element codes is the potential for race conditions during the assembly phase. We therefore investigated different methods such as mesh coloring or atomic updates on the GPU. In order to achieve strong scaling, we needed to ensure good overlap of data motion at all levels, including internode and host-accelerator transfers. Finally we carefully tuned the GPU implementation. The new MPI/CUDA solver exhibits excellent scalability and achieves speedup on a node-to-node basis over the carefully tuned equivalent multi-core MPI solver. To demonstrate the performance of both the forward and adjoint functionality, we present two case studies run on the Cray XE6 CPU and Cray XK6 GPU architectures up to 896 nodes: (1) focusing on most commonly used forward simulations, we simulate seismic wave propagation generated by earthquakes in Turkey, and (2) testing the most complex seismic inversion type of the package, we use ambient seismic noise to image 3-D crust and mantle structure beneath western Europe. © 2012 IEEE
PDE Solvers for Hybrid CPU-GPU Architectures
Many problems of scientific and industrial interest are investigated through numerically solving partial differential equations (PDEs). For some of these problems, the scope of the investigation is limited by the costs of computational resources. A new approach to reducing these costs is the use of coprocessors, such as graphics processing units (GPUs) and Many Integrated Core (MIC) cards, which can execute floating point operations at a higher rate than a central processing unit (CPU) of the same cost. This is achieved through the use of a large number of processors in a single device, each with very limited dedicated memory per thread. Codes for a number of continuum methods, such as boundary element methods (BEM), finite element methods (FEM) and finite difference methods (FDM) have already been implemented on coprocessor architectures. These methods were designed before the adoption of coprocessor architectures, so implementing them efficiently with reduced thread-level memory can be challenging. There are other methods that do operate efficiently with limited thread-level memory, such as Monte Carlo methods (MCM) and lattice Boltzmann methods (LBM) for kinetic formulations of PDEs, but they are not competitive on CPUs and generally have poorer convergence than the continuum methods. In this work, we introduce a class of methods in which the parallelism of kinetic formulations on GPUs is combined with the better convergence of continuum methods on CPUs. We first extend an existing Feynman-Kac formulation for determining the principal eigenpair of an elliptic operator to create a version that can retrieve arbitrarily many eigenpairs. This new method is implemented for multiple GPUs, and combined with a standard deflation preconditioner on multiple CPUs to create a hybrid concurrent method with superior convergence to that of the deflation preconditioner alone. The hybrid method exhibits good parallelism, with an efficiency of 80% on a problem with 300 million unknowns, run on a configuration of 324 CPU cores and 54 GPUs.Doctor of Philosoph
Doctor of Philosophy
dissertationMemory access irregularities are a major bottleneck for bandwidth limited problems on Graphics Processing Unit (GPU) architectures. GPU memory systems are designed to allow consecutive memory accesses to be coalesced into a single memory access. Noncontiguous accesses within a parallel group of threads working in lock step may cause serialized memory transfers. Irregular algorithms may have data-dependent control flow and memory access, which requires runtime information to be evaluated. Compile time methods for evaluating parallelism, such as static dependence graphs, are not capable of evaluating irregular algorithms. The goals of this dissertation are to study irregularities within the context of unstructured mesh and sparse matrix problems, analyze the impact of vectorization widths on irregularities, and present data-centric methods that improve control flow and memory access irregularity within those contexts. Reordering associative operations has often been exploited for performance gains in parallel algorithms. This dissertation presents a method for associative reordering of stencil computations over unstructured meshes that increases data reuse through caching. This novel parallelization scheme offers considerable speedups over standard methods. Vectorization widths can have significant impact on performance in vectorized computations. Although the hardware vector width is generally fixed, the logical vector width used within a computation can range from one up to the width of the computation. Significant performance differences can occur due to thread scheduling and resource limitations. This dissertation analyzes the impact of vectorization widths on dense numerical computations such as 3D dG postprocessing. It is difficult to efficiently perform dynamic updates on traditional sparse matrix formats. Explicitly controlling memory segmentation allows for in-place dynamic updates in sparse matrices. Dynamically updating the matrix without rebuilding or sorting greatly improves processing time and overall throughput. This dissertation presents a new sparse matrix format, dynamic compressed sparse row (DCSR), which allows for dynamic streaming updates to a sparse matrix. A new method for parallel sparse matrix-matrix multiplication (SpMM) that uses dynamic updates is also presented
Flexible Modellerweiterung und Optimierung von Erdbebensimulationen
Simulations of realistic earthquake scenarios require scalable software and extensive supercomputing resources. With increasing fidelity in simulations, advanced rheological and source models need to be incorporated. I introduce a domain-specific language in order to handle the model flexibility in combination with the high efficiency requirements. The contributions in this thesis enabled the to date largest and longest dynamic rupture simulation of the 2004 Sumatra earthquake.Realistische Erdbebensimulationen benötigen skalierbare Software und beträchtliche Rechenressourcen. Mit zunehmender Genauigkeit der Simulationen müssen fortschrittliche rheologische und Quellmodelle integriert werden. Ich führe eine domänenspezifische Sprache ein, um die Modelflexibilität in Kombination mit den hohen Effizienzanforderungen zu beherrschen. Die Beiträge in dieser Arbeit haben die bisher größte und längste dynamische Bruchsimulation des Sumatra-Erdbebens von 2004 ermöglicht