3,172 research outputs found
An Optimized, Easy-to-use, Open-source GPU Solver for Large-scale Inverse Homogenization Problems
We propose a high-performance GPU solver for inverse homogenization problems
to design high-resolution 3D microstructures. Central to our solver is a
favorable combination of data structures and algorithms, making full use of the
parallel computation power of today's GPUs through a software-level design
space exploration. This solver is demonstrated to optimize homogenized
stiffness tensors, such as bulk modulus, shear modulus, and Poisson's ratio,
under the constraint of bounded material volume. Practical high-resolution
examples with 512^3(134.2 million) finite elements run in less than 32 seconds
per iteration with a peak memory of 21 GB. Besides, our GPU implementation is
equipped with an easy-to-use framework with less than 20 lines of code to
support various objective functions defined by the homogenized stiffness
tensors. Our open-source high-performance implementation is publicly accessible
at https://github.com/lavenklau/homo3d
A new smoothed particle hydrodynamics method based on high-order moving-least-square targeted essentially non-oscillatory scheme for compressible flows
In this study, we establish a hybrid high-order smoothed particle
hydrodynamics (SPH) framework (MLS-TENO-SPH) for compressible flows with
discontinuities, which is able to achieve genuine high-order convergence in
smooth regions and also capture discontinuities well in non-smooth regions. The
framework can be either fully Lagrangian, Eulerian or realizing
arbitary-Lagrangian-Eulerian (ALE) feature enforcing the isotropic particle
distribution in specific cases. In the proposed framework, the computational
domain is divided into smooth regions and non-smooth regions, and these two
regions are determined by a strong scale separation strategy in the targeted
essentially non-oscillatory (TENO) scheme. In smooth regions, the
moving-least-square (MLS) approximation is used for evaluating high-order
derivative operator, which is able to realize genuine high-order construction;
in non-smooth regions, the new TENO scheme based on Vila's framework with
several new improvements will be deployed to capture discontinuities and
high-wavenumber flow scales with low numerical dissipation. The present
MLS-TENO-SPH method is validated with a set of challenging cases based on the
Eulerian, Lagrangian or ALE framework. Numerical results demonstrate that the
MLS-TENO-SPH method features lower numerical dissipation and higher efficiency
than the conventional method, and can restore genuine high-order accuracy in
smooth regions. Overall, the proposed framework serves as a new exploration in
high-order SPH methods, which are potential for compressible flow simulations
with shockwaves.Comment: 36 pages, 15 figures, accepted by Journal of Computational Physics on
June 1st, 202
Guided rewriting and constraint satisfaction for parallel GPU code generation
Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise.
This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only.
Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings.
The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation.
A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation
Automatic Loop Nest Parallelization for the Predictable Execution Model
Currently, embedded real-time systems still widely use single-core processors. A major challenge in the adoption of multicore processors is the presence of shared hardware resources such as main memory. Contention between threads executing on different cores for access to such resources makes it difficult to tightly estimate the Worst-Case Execution Time (WCET) of applications. To safely employ multicore processors in real-time systems, previous work has introduced a PRedictable Execution Model (PREM) for embedded Multi-Processor Systems-on-a-Chip (MPSoCs). Under PREM, each thread is divided into memory phases, where the code and data required by the thread are moved from main memory to a local memory (cache or scratchpad) or vice versa, and execution phases, where the thread computes based on the code and data available in local memory. Memory phases are then scheduled by the Operating System (OS) to avoid contention among threads, thus resulting in tight WCET bounds. The main challenge in applying the model is to automatically generate optimized PREM-compliant code instead of rewriting programs manually. Note that many programs of interests, such as emerging AI and neural network kernels, comprise both compute-intensive and memory-intensive deeply nested loops. Hence, PREM code generation and optimization should be applicable to nested loop structures and consider whether performance is constrained by computation or memory transfers.
In this thesis, we address the problem of automatically parallelizing and optimizing nested loop structure programs by presenting a workflow that automatically generates PREM-compliant optimized code. To correctly model the structure of nested loop programs, we leverage existing polyhedral compilation tools that analyze the original program and generate optimized executables. Two main techniques are adopted for optimization: loop tiling and parallelization. We build a timing model to estimate the length of execution and memory phases, and then construct a Directed Acyclic Graph (DAG) of program phases to estimate its makespan. During this process, our framework searches for the combination of tile sizes and thread numbers that minimize the makespan of the program; given the complexity of the optimization problem, we design a heuristic algorithm to find solutions close to the optimal. Finally, to show its usefulness, we evaluate our technique based on the Gem5 architectural simulator on computational kernels from the PolyBench-NN benchmark
Towards Exascale Computation for Turbomachinery Flows
A state-of-the-art large eddy simulation code has been developed to solve
compressible flows in turbomachinery. The code has been engineered with a high
degree of scalability, enabling it to effectively leverage the many-core
architecture of the new Sunway system. A consistent performance of 115.8
DP-PFLOPs has been achieved on a high-pressure turbine cascade consisting of
over 1.69 billion mesh elements and 865 billion Degree of Freedoms (DOFs). By
leveraging a high-order unstructured solver and its portability to large
heterogeneous parallel systems, we have progressed towards solving the grand
challenge problem outlined by NASA, which involves a time-dependent simulation
of a complete engine, incorporating all the aerodynamic and heat transfer
components.Comment: SC23, November, 2023, Denver, CO., US
Lifting Code Generation of Cardiac Physiology Simulation to Novel Compiler Technology
International audienceThe study of numerical models for the human body has become a major focus of the research community in biology and medicine. For instance, numerical ionic models of a complex organ, such as the heart, must be able to represent individual cells and their interconnections through ionic channels, forming a system with billions of cells, and requiring efficient code to handle such a large system. The modeling of the electrical system of the heart combines a compute-intensive kernel that calculates the intensity of current flowing through cell membranes, and feeds a linear solver for computing the electrical potential of each cell. Considering this context, we propose limpetMLIR, a code generator and compiler transformer to accelerate the kernel phase of ionic models and bridge the gap between compiler technology and electrophysiology simulation. LimpetMLIR makes use of the MLIR infrastructure, its dialects, and transformations to drive forward the study of ionic models, and accelerate the execution of multi-cell systems. Experiments conducted in 43 ionic models show that our limpetMLIR based code generation greatly outperforms current state-ofthe-art simulation systems by an average of 2.9×, reaching peak speedups of more than 15× in some cases. To our knowledge, this is the first work that deeply connects an optimizing compiler infrastructure to electrophysiology models of the human body, showing the potential benefits of using compiler technology in the simulation of human cell interactions
Turbulence closure with small, local neural networks: Forced two-dimensional and -plane flows
We parameterize sub-grid scale (SGS) fluxes in sinusoidally forced
two-dimensional turbulence on the -plane at high Reynolds numbers
(Re25000) using simple 2-layer Convolutional Neural Networks (CNN) having
only O(1000)parameters, two orders of magnitude smaller than recent studies
employing deeper CNNs with 8-10 layers; we obtain stable, accurate, and
long-term online or a posteriori solutions at 16X downscaling factors. Our
methodology significantly improves training efficiency and speed of online
Large Eddy Simulations (LES) runs, while offering insights into the physics of
closure in such turbulent flows. Our approach benefits from extensive
hyperparameter searching in learning rate and weight decay coefficient space,
as well as the use of cyclical learning rate annealing, which leads to more
robust and accurate online solutions compared to fixed learning rates. Our CNNs
use either the coarse velocity or the vorticity and strain fields as inputs,
and output the two components of the deviatoric stress tensor. We minimize a
loss between the SGS vorticity flux divergence (computed from the
high-resolution solver) and that obtained from the CNN-modeled deviatoric
stress tensor, without requiring energy or enstrophy preserving constraints.
The success of shallow CNNs in accurately parameterizing this class of
turbulent flows implies that the SGS stresses have a weak non-local dependence
on coarse fields; it also aligns with our physical conception that small-scales
are locally controlled by larger scales such as vortices and their strained
filaments. Furthermore, 2-layer CNN-parameterizations are more likely to be
interpretable and generalizable because of their intrinsic low dimensionality.Comment: 27 pages, 13 figure
Current issues of the management of socio-economic systems in terms of globalization challenges
The authors of the scientific monograph have come to the conclusion that the management of socio-economic systems in the terms of global challenges requires the use of mechanisms to ensure security, optimise the use of resource potential, increase competitiveness, and provide state support to economic entities. Basic research focuses on assessment of economic entities in the terms of global challenges, analysis of the financial system, migration flows, logistics and product exports, territorial development. The research results have been implemented in the different decision-making models in the context of global challenges, strategic planning, financial and food security, education management, information technology and innovation. The results of the study can be used in the developing of directions, programmes and strategies for sustainable development of economic entities and regions, increasing the competitiveness of products and services, decision-making at the level of ministries and agencies that regulate the processes of managing socio-economic systems. The results can also be used by students and young scientists in the educational process and conducting scientific research on the management of socio-economic systems in the terms of global challenges
Ordonnancement sous contrainte mémoire en domptant la localité des données dans un modèle de programmation à base de tâches
International audienceA now-classical way of meeting the increasing demand for computing speed by HPC applications is the use of GPUs and/or otheraccelerators. Such accelerators have their own memory, which is usually quite limited, and are connected to the main memorythrough a bus with bounded bandwidth. Thus, particular care should be devoted to data locality in order to avoid unnecessary datamovements. Task-based runtime schedulers have emerged as a convenient and efficient way to use such heterogeneous platforms.When processing an application, the scheduler has the knowledge of all tasks available for processing on a GPU, as well astheir input data dependencies. Hence, it is possible to produce a tasks processing order aiming at reducing the total processingtime through three objectives: minimizing data transfers, overlapping transfers and computation and optimizing the eviction ofpreviously-loaded data. In this paper, we focus on how to schedule tasks that share some of their input data (but are otherwiseindependent) on a single GPU. We provide a formal model of the problem, exhibit an optimal eviction strategy, and show thatordering tasks to minimize data movement is NP-complete. We review and adapt existing ordering strategies to this problem,and propose a new one based on task aggregation. We prove that the underlying problem of this new strategy is NP-complete,and prove the reasonable complexity of our proposed heuristic. These strategies have been implemented in the StarPU runtimesystem. We present their performance on tasks from tiled 2D, 3D matrix products, Cholesky factorization, randomized task order,randomized data pairs from the 2D matrix product as well as a sparse matrix product. We introduce a visual way to understandthese performance and lower bounds on the number of data loads for the 2D and 3D matrix products. Our experiments demonstratethat using our new strategy together with the optimal eviction policy reduces the amount of data movement as well as the totalprocessing time
Algebraic Temporal Blocking for Sparse Iterative Solvers on Multi-Core CPUs
Sparse linear iterative solvers are essential for many large-scale
simulations. Much of the runtime of these solvers is often spent in the
implicit evaluation of matrix polynomials via a sequence of sparse
matrix-vector products. A variety of approaches has been proposed to make these
polynomial evaluations explicit (i.e., fix the coefficients), e.g., polynomial
preconditioners or s-step Krylov methods. Furthermore, it is nowadays a popular
practice to approximate triangular solves by a matrix polynomial to increase
parallelism. Such algorithms allow to evaluate the polynomial using a so-called
matrix power kernel (MPK), which computes the product between a power of a
sparse matrix A and a dense vector x, or a related operation. Recently we have
shown that using the level-based formulation of sparse matrix-vector
multiplications in the Recursive Algebraic Coloring Engine (RACE) framework we
can perform temporal cache blocking of MPK to increase its performance. In this
work, we demonstrate the application of this cache-blocking optimization in
sparse iterative solvers.
By integrating the RACE library into the Trilinos framework, we demonstrate
the speedups achieved in preconditioned) s-step GMRES, polynomial
preconditioners, and algebraic multigrid (AMG). For MPK-dominated algorithms we
achieve speedups of up to 3x on modern multi-core compute nodes. For algorithms
with moderate contributions from subspace orthogonalization, the gain reduces
significantly, which is often caused by the insufficient quality of the
orthogonalization routines. Finally, we showcase the application of
RACE-accelerated solvers in a real-world wind turbine simulation (Nalu-Wind)
and highlight the new opportunities and perspectives opened up by RACE as a
cache-blocking technique for MPK-enabled sparse solvers.Comment: 25 pages, 11 figures, 3 table
- …