3,172 research outputs found

    An Optimized, Easy-to-use, Open-source GPU Solver for Large-scale Inverse Homogenization Problems

    Full text link
    We propose a high-performance GPU solver for inverse homogenization problems to design high-resolution 3D microstructures. Central to our solver is a favorable combination of data structures and algorithms, making full use of the parallel computation power of today's GPUs through a software-level design space exploration. This solver is demonstrated to optimize homogenized stiffness tensors, such as bulk modulus, shear modulus, and Poisson's ratio, under the constraint of bounded material volume. Practical high-resolution examples with 512^3(134.2 million) finite elements run in less than 32 seconds per iteration with a peak memory of 21 GB. Besides, our GPU implementation is equipped with an easy-to-use framework with less than 20 lines of code to support various objective functions defined by the homogenized stiffness tensors. Our open-source high-performance implementation is publicly accessible at https://github.com/lavenklau/homo3d

    A new smoothed particle hydrodynamics method based on high-order moving-least-square targeted essentially non-oscillatory scheme for compressible flows

    Full text link
    In this study, we establish a hybrid high-order smoothed particle hydrodynamics (SPH) framework (MLS-TENO-SPH) for compressible flows with discontinuities, which is able to achieve genuine high-order convergence in smooth regions and also capture discontinuities well in non-smooth regions. The framework can be either fully Lagrangian, Eulerian or realizing arbitary-Lagrangian-Eulerian (ALE) feature enforcing the isotropic particle distribution in specific cases. In the proposed framework, the computational domain is divided into smooth regions and non-smooth regions, and these two regions are determined by a strong scale separation strategy in the targeted essentially non-oscillatory (TENO) scheme. In smooth regions, the moving-least-square (MLS) approximation is used for evaluating high-order derivative operator, which is able to realize genuine high-order construction; in non-smooth regions, the new TENO scheme based on Vila's framework with several new improvements will be deployed to capture discontinuities and high-wavenumber flow scales with low numerical dissipation. The present MLS-TENO-SPH method is validated with a set of challenging cases based on the Eulerian, Lagrangian or ALE framework. Numerical results demonstrate that the MLS-TENO-SPH method features lower numerical dissipation and higher efficiency than the conventional method, and can restore genuine high-order accuracy in smooth regions. Overall, the proposed framework serves as a new exploration in high-order SPH methods, which are potential for compressible flow simulations with shockwaves.Comment: 36 pages, 15 figures, accepted by Journal of Computational Physics on June 1st, 202

    Guided rewriting and constraint satisfaction for parallel GPU code generation

    Get PDF
    Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise. This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only. Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings. The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation. A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation

    Automatic Loop Nest Parallelization for the Predictable Execution Model

    Get PDF
    Currently, embedded real-time systems still widely use single-core processors. A major challenge in the adoption of multicore processors is the presence of shared hardware resources such as main memory. Contention between threads executing on different cores for access to such resources makes it difficult to tightly estimate the Worst-Case Execution Time (WCET) of applications. To safely employ multicore processors in real-time systems, previous work has introduced a PRedictable Execution Model (PREM) for embedded Multi-Processor Systems-on-a-Chip (MPSoCs). Under PREM, each thread is divided into memory phases, where the code and data required by the thread are moved from main memory to a local memory (cache or scratchpad) or vice versa, and execution phases, where the thread computes based on the code and data available in local memory. Memory phases are then scheduled by the Operating System (OS) to avoid contention among threads, thus resulting in tight WCET bounds. The main challenge in applying the model is to automatically generate optimized PREM-compliant code instead of rewriting programs manually. Note that many programs of interests, such as emerging AI and neural network kernels, comprise both compute-intensive and memory-intensive deeply nested loops. Hence, PREM code generation and optimization should be applicable to nested loop structures and consider whether performance is constrained by computation or memory transfers. In this thesis, we address the problem of automatically parallelizing and optimizing nested loop structure programs by presenting a workflow that automatically generates PREM-compliant optimized code. To correctly model the structure of nested loop programs, we leverage existing polyhedral compilation tools that analyze the original program and generate optimized executables. Two main techniques are adopted for optimization: loop tiling and parallelization. We build a timing model to estimate the length of execution and memory phases, and then construct a Directed Acyclic Graph (DAG) of program phases to estimate its makespan. During this process, our framework searches for the combination of tile sizes and thread numbers that minimize the makespan of the program; given the complexity of the optimization problem, we design a heuristic algorithm to find solutions close to the optimal. Finally, to show its usefulness, we evaluate our technique based on the Gem5 architectural simulator on computational kernels from the PolyBench-NN benchmark

    Towards Exascale Computation for Turbomachinery Flows

    Full text link
    A state-of-the-art large eddy simulation code has been developed to solve compressible flows in turbomachinery. The code has been engineered with a high degree of scalability, enabling it to effectively leverage the many-core architecture of the new Sunway system. A consistent performance of 115.8 DP-PFLOPs has been achieved on a high-pressure turbine cascade consisting of over 1.69 billion mesh elements and 865 billion Degree of Freedoms (DOFs). By leveraging a high-order unstructured solver and its portability to large heterogeneous parallel systems, we have progressed towards solving the grand challenge problem outlined by NASA, which involves a time-dependent simulation of a complete engine, incorporating all the aerodynamic and heat transfer components.Comment: SC23, November, 2023, Denver, CO., US

    Lifting Code Generation of Cardiac Physiology Simulation to Novel Compiler Technology

    Get PDF
    International audienceThe study of numerical models for the human body has become a major focus of the research community in biology and medicine. For instance, numerical ionic models of a complex organ, such as the heart, must be able to represent individual cells and their interconnections through ionic channels, forming a system with billions of cells, and requiring efficient code to handle such a large system. The modeling of the electrical system of the heart combines a compute-intensive kernel that calculates the intensity of current flowing through cell membranes, and feeds a linear solver for computing the electrical potential of each cell. Considering this context, we propose limpetMLIR, a code generator and compiler transformer to accelerate the kernel phase of ionic models and bridge the gap between compiler technology and electrophysiology simulation. LimpetMLIR makes use of the MLIR infrastructure, its dialects, and transformations to drive forward the study of ionic models, and accelerate the execution of multi-cell systems. Experiments conducted in 43 ionic models show that our limpetMLIR based code generation greatly outperforms current state-ofthe-art simulation systems by an average of 2.9×, reaching peak speedups of more than 15× in some cases. To our knowledge, this is the first work that deeply connects an optimizing compiler infrastructure to electrophysiology models of the human body, showing the potential benefits of using compiler technology in the simulation of human cell interactions

    Turbulence closure with small, local neural networks: Forced two-dimensional and β\beta-plane flows

    Full text link
    We parameterize sub-grid scale (SGS) fluxes in sinusoidally forced two-dimensional turbulence on the β\beta-plane at high Reynolds numbers (Re∼\sim25000) using simple 2-layer Convolutional Neural Networks (CNN) having only O(1000)parameters, two orders of magnitude smaller than recent studies employing deeper CNNs with 8-10 layers; we obtain stable, accurate, and long-term online or a posteriori solutions at 16X downscaling factors. Our methodology significantly improves training efficiency and speed of online Large Eddy Simulations (LES) runs, while offering insights into the physics of closure in such turbulent flows. Our approach benefits from extensive hyperparameter searching in learning rate and weight decay coefficient space, as well as the use of cyclical learning rate annealing, which leads to more robust and accurate online solutions compared to fixed learning rates. Our CNNs use either the coarse velocity or the vorticity and strain fields as inputs, and output the two components of the deviatoric stress tensor. We minimize a loss between the SGS vorticity flux divergence (computed from the high-resolution solver) and that obtained from the CNN-modeled deviatoric stress tensor, without requiring energy or enstrophy preserving constraints. The success of shallow CNNs in accurately parameterizing this class of turbulent flows implies that the SGS stresses have a weak non-local dependence on coarse fields; it also aligns with our physical conception that small-scales are locally controlled by larger scales such as vortices and their strained filaments. Furthermore, 2-layer CNN-parameterizations are more likely to be interpretable and generalizable because of their intrinsic low dimensionality.Comment: 27 pages, 13 figure

    Current issues of the management of socio-economic systems in terms of globalization challenges

    Get PDF
    The authors of the scientific monograph have come to the conclusion that the management of socio-economic systems in the terms of global challenges requires the use of mechanisms to ensure security, optimise the use of resource potential, increase competitiveness, and provide state support to economic entities. Basic research focuses on assessment of economic entities in the terms of global challenges, analysis of the financial system, migration flows, logistics and product exports, territorial development. The research results have been implemented in the different decision-making models in the context of global challenges, strategic planning, financial and food security, education management, information technology and innovation. The results of the study can be used in the developing of directions, programmes and strategies for sustainable development of economic entities and regions, increasing the competitiveness of products and services, decision-making at the level of ministries and agencies that regulate the processes of managing socio-economic systems. The results can also be used by students and young scientists in the educational process and conducting scientific research on the management of socio-economic systems in the terms of global challenges

    Ordonnancement sous contrainte mémoire en domptant la localité des données dans un modèle de programmation à base de tâches

    Get PDF
    International audienceA now-classical way of meeting the increasing demand for computing speed by HPC applications is the use of GPUs and/or otheraccelerators. Such accelerators have their own memory, which is usually quite limited, and are connected to the main memorythrough a bus with bounded bandwidth. Thus, particular care should be devoted to data locality in order to avoid unnecessary datamovements. Task-based runtime schedulers have emerged as a convenient and efficient way to use such heterogeneous platforms.When processing an application, the scheduler has the knowledge of all tasks available for processing on a GPU, as well astheir input data dependencies. Hence, it is possible to produce a tasks processing order aiming at reducing the total processingtime through three objectives: minimizing data transfers, overlapping transfers and computation and optimizing the eviction ofpreviously-loaded data. In this paper, we focus on how to schedule tasks that share some of their input data (but are otherwiseindependent) on a single GPU. We provide a formal model of the problem, exhibit an optimal eviction strategy, and show thatordering tasks to minimize data movement is NP-complete. We review and adapt existing ordering strategies to this problem,and propose a new one based on task aggregation. We prove that the underlying problem of this new strategy is NP-complete,and prove the reasonable complexity of our proposed heuristic. These strategies have been implemented in the StarPU runtimesystem. We present their performance on tasks from tiled 2D, 3D matrix products, Cholesky factorization, randomized task order,randomized data pairs from the 2D matrix product as well as a sparse matrix product. We introduce a visual way to understandthese performance and lower bounds on the number of data loads for the 2D and 3D matrix products. Our experiments demonstratethat using our new strategy together with the optimal eviction policy reduces the amount of data movement as well as the totalprocessing time

    Algebraic Temporal Blocking for Sparse Iterative Solvers on Multi-Core CPUs

    Full text link
    Sparse linear iterative solvers are essential for many large-scale simulations. Much of the runtime of these solvers is often spent in the implicit evaluation of matrix polynomials via a sequence of sparse matrix-vector products. A variety of approaches has been proposed to make these polynomial evaluations explicit (i.e., fix the coefficients), e.g., polynomial preconditioners or s-step Krylov methods. Furthermore, it is nowadays a popular practice to approximate triangular solves by a matrix polynomial to increase parallelism. Such algorithms allow to evaluate the polynomial using a so-called matrix power kernel (MPK), which computes the product between a power of a sparse matrix A and a dense vector x, or a related operation. Recently we have shown that using the level-based formulation of sparse matrix-vector multiplications in the Recursive Algebraic Coloring Engine (RACE) framework we can perform temporal cache blocking of MPK to increase its performance. In this work, we demonstrate the application of this cache-blocking optimization in sparse iterative solvers. By integrating the RACE library into the Trilinos framework, we demonstrate the speedups achieved in preconditioned) s-step GMRES, polynomial preconditioners, and algebraic multigrid (AMG). For MPK-dominated algorithms we achieve speedups of up to 3x on modern multi-core compute nodes. For algorithms with moderate contributions from subspace orthogonalization, the gain reduces significantly, which is often caused by the insufficient quality of the orthogonalization routines. Finally, we showcase the application of RACE-accelerated solvers in a real-world wind turbine simulation (Nalu-Wind) and highlight the new opportunities and perspectives opened up by RACE as a cache-blocking technique for MPK-enabled sparse solvers.Comment: 25 pages, 11 figures, 3 table
    • …
    corecore