244 research outputs found

    Implicit-Explicit Time Integration for the Immersed Wave Equation

    Full text link
    Immersed boundary methods simplify mesh generation by embedding the domain of interest into an extended domain that is easy to mesh, introducing the challenge of dealing with cells that intersect the domain boundary. Combined with explicit time integration schemes, the finite cell method introduces a lower bound for the critical time step size. Explicit transient analyses commonly use the spectral element method due to its natural way of obtaining diagonal mass matrices through nodal lumping. Its combination with the finite cell method is called the spectral cell method. Unfortunately, a direct application of nodal lumping in the spectral cell method is impossible due to the special quadrature necessary to treat the discontinuous integrand inside the cut cells. We analyze an implicit-explicit (IMEX) time integration method to exploit the advantages of the nodal lumping scheme for uncut cells on one side and the unconditional stability of implicit time integration schemes for cut cells on the other. In this hybrid, immersed Newmark IMEX approach, we use explicit second-order central differences to integrate the uncut degrees of freedom that lead to a diagonal block in the mass matrix and an implicit trapezoidal Newmark method to integrate the remaining degrees of freedom (those supported by at least one cut cell). The immersed Newmark IMEX approach preserves the high-order convergence rates and the geometric flexibility of the finite cell method. We analyze a simple system of spring-coupled masses to highlight some of the essential characteristics of Newmark IMEX time integration. We then solve the scalar wave equation on two- and three-dimensional examples with significant geometric complexity to show that our approach is more efficient than state-of-the-art time integration schemes when comparing accuracy and runtime

    LIPIcs, Volume 274, ESA 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 274, ESA 2023, Complete Volum

    LIPIcs, Volume 277, GIScience 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 277, GIScience 2023, Complete Volum

    Novel Degrees of Freedom, Constraints, and Stiffness Formulation for Physically Based Animation

    Get PDF
    I identify and improve upon three distinct components of physically simulated systems with the aim of increasing both robustness and efficiency for the application of computer graphics: A) the degrees of freedom of a system; B) the constraints put on that system; C) and the stiffness that derives from force differentiation and in turn enables implicit integration techniques. These three components come up in many implementations of physics-based simulation in computer animation. From a combination of these components, I explore four novel ideas implemented and experimented on over the course of my graduate degree. Eulerian-on-Lagrangian Cloth Simulation resolves a longstanding problem of simulating contact-mediated interaction of cloth and sharp geometric features by exploring a combination of all three of our components. Bilateral Staggered Projections for Joints explores the constrained degrees of freedom of articulated rigid bodies in a reduced state to extend the popular Staggered Projects technique into a novel formulation for rapid evaluation of frictional articulated dynamics. Condensation Jacobian with Adaptivity looks at using reduction methods to improve the efficiency of soft body deformations by allowing larger time step in dynamics simulations. Finally, Ldot: Boosting Deformation Performance with Cholesky Extrapolation explores the inner workings of sparse direct solvers to introduce a Cholesky factorization that is linearly extrapolated in time, which can improve the performance when encapsulated inside an iterative nonlinear solver

    LIPIcs, Volume 258, SoCG 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 258, SoCG 2023, Complete Volum

    12th International Conference on Geographic Information Science: GIScience 2023, September 12–15, 2023, Leeds, UK

    Get PDF
    No abstract available

    CRYSTAL23

    Get PDF

    Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors

    Full text link
    Factorization and multiplication of dense matrices and tensors are critical, yet extremely expensive pieces of the scientific toolbox. Careful use of low rank approximation can drastically reduce the computation and memory requirements of these operations. In addition to a lower arithmetic complexity, such methods can, by their structure, be designed to efficiently exploit modern hardware architectures. The majority of existing work relies on batched BLAS libraries to handle the computation of many small dense matrices. We show that through careful analysis of the cache utilization, register accumulation using SIMD registers and a redesign of the implementation, one can achieve significantly higher throughput for these types of batched low-rank matrices across a large range of block and batch sizes. We test our algorithm on 3 CPUs using diverse ISAs -- the Fujitsu A64FX using ARM SVE, the Intel Xeon 6148 using AVX-512 and AMD EPYC 7502 using AVX-2, and show that our new batching methodology is able to obtain more than twice the throughput of vendor optimized libraries for all CPU architectures and problem sizes

    Task-based Runtime Optimizations Towards High Performance Computing Applications

    Get PDF
    The last decades have witnessed a rapid improvement of computational capabilities in high-performance computing (HPC) platforms thanks to hardware technology scaling. HPC architectures benefit from mainstream advances on the hardware with many-core systems, deep hierarchical memory subsystem, non-uniform memory access, and an ever-increasing gap between computational power and memory bandwidth. This has necessitated continuous adaptations across the software stack to maintain high hardware utilization. In this HPC landscape of potentially million-way parallelism, task-based programming models associated with dynamic runtime systems are becoming more popular, which fosters developers’ productivity at extreme scale by abstracting the underlying hardware complexity. In this context, this dissertation highlights how a software bundle powered by a task-based programming model can address the heterogeneous workloads engendered by HPC applications., i.e., data redistribution, geospatial modeling and 3D unstructured mesh deformation here. Data redistribution aims to reshuffle data to optimize some objective for an algorithm, whose objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal of increasing the efficiency and therefore reducing the time-to-solution for the algorithm. Geostatistical modeling, one of the prime motivating applications for exascale computing, is a technique for predicting desired quantities from geographically distributed data, based on statistical models and optimization of parameters. Meshing the deformable contour of moving 3D bodies is an expensive operation that can cause huge computational challenges in fluid-structure interaction (FSI) applications. Therefore, in this dissertation, Redistribute-PaRSEC, ExaGeoStat-PaRSEC and HiCMA-PaRSEC are proposed to efficiently tackle these HPC applications respectively at extreme scale, and they are evaluated on multiple HPC clusters, including AMD-based, Intel-based, Arm-based CPU systems and IBM-based multi-GPU system. This multidisciplinary work emphasizes the need for runtime systems to go beyond their primary responsibility of task scheduling on massively parallel hardware system for servicing the next-generation scientific applications

    A domain-extensible compiler with controllable automation of optimisations

    Get PDF
    In high performance domains like image processing, physics simulation or machine learning, program performance is critical. Programmers called performance engineers are responsible for the challenging task of optimising programs. Two major challenges prevent modern compilers targeting heterogeneous architectures from reliably automating optimisation. First, domain specific compilers such as Halide for image processing and TVM for machine learning are difficult to extend with the new optimisations required by new algorithms and hardware. Second, automatic optimisation is often unable to achieve the required performance, and performance engineers often fall back to painstaking manual optimisation. This thesis shows the potential of the Shine compiler to achieve domain-extensibility, controllable automation, and generate high performance code. Domain-extensibility facilitates adapting compilers to new algorithms and hardware. Controllable automation enables performance engineers to gradually take control of the optimisation process. The first research contribution is to add 3 code generation features to Shine, namely: synchronisation barrier insertion, kernel execution, and storage folding. Adding these features requires making novel design choices in terms of compiler extensibility and controllability. The rest of this thesis builds on these features to generate code with competitive runtime compared to established domain-specific compilers. The second research contribution is to demonstrate how extensibility and controllability are exploited to optimise a standard image processing pipeline for corner detection. Shine achieves 6 well-known image processing optimisations, 2 of them not being supported by Halide. Our results on 4 ARM multi-core CPUs show that the code generated by Shine for corner detection runs up to 1.4Ă— faster than the Halide code. However, we observe that controlling rewriting is tedious, motivating the need for more automation. The final research contribution is to introduce sketch-guided equality saturation, a semiautomated technique that allows performance engineers to guide program rewriting by specifying rewrite goals as sketches: program patterns that leave details unspecified. We evaluate this approach by applying 7 realistic optimisations of matrix multiplication. Without guidance, the compiler fails to apply the 5 most complex optimisations even given an hour and 60GB of RAM. With the guidance of at most 3 sketch guides, each 10 times smaller than the complete program, the compiler applies the optimisations in seconds using less than 1GB
    • …
    corecore