369 research outputs found
Adaptive BDDC in Three Dimensions
The adaptive BDDC method is extended to the selection of face constraints in
three dimensions. A new implementation of the BDDC method is presented based on
a global formulation without an explicit coarse problem, with massive
parallelism provided by a multifrontal solver. Constraints are implemented by a
projection and sparsity of the projected operator is preserved by a generalized
change of variables. The effectiveness of the method is illustrated on several
engineering problems.Comment: 28 pages, 9 figures, 9 table
A GPU-accelerated package for simulation of flow in nanoporous source rocks with many-body dissipative particle dynamics
Mesoscopic simulations of hydrocarbon flow in source shales are challenging,
in part due to the heterogeneous shale pores with sizes ranging from a few
nanometers to a few micrometers. Additionally, the sub-continuum fluid-fluid
and fluid-solid interactions in nano- to micro-scale shale pores, which are
physically and chemically sophisticated, must be captured. To address those
challenges, we present a GPU-accelerated package for simulation of flow in
nano- to micro-pore networks with a many-body dissipative particle dynamics
(mDPD) mesoscale model. Based on a fully distributed parallel paradigm, the
code offloads all intensive workloads on GPUs. Other advancements, such as
smart particle packing and no-slip boundary condition in complex pore
geometries, are also implemented for the construction and the simulation of the
realistic shale pores from 3D nanometer-resolution stack images. Our code is
validated for accuracy and compared against the CPU counterpart for speedup. In
our benchmark tests, the code delivers nearly perfect strong scaling and weak
scaling (with up to 512 million particles) on up to 512 K20X GPUs on Oak Ridge
National Laboratory's (ORNL) Titan supercomputer. Moreover, a single-GPU
benchmark on ORNL's SummitDev and IBM's AC922 suggests that the host-to-device
NVLink can boost performance over PCIe by a remarkable 40\%. Lastly, we
demonstrate, through a flow simulation in realistic shale pores, that the CPU
counterpart requires 840 Power9 cores to rival the performance delivered by our
package with four V100 GPUs on ORNL's Summit architecture. This simulation
package enables quick-turnaround and high-throughput mesoscopic numerical
simulations for investigating complex flow phenomena in nano- to micro-porous
rocks with realistic pore geometries
Generating and auto-tuning parallel stencil codes
In this thesis, we present a software framework, Patus, which generates high performance stencil codes for different types of hardware platforms, including current multicore CPU and graphics processing unit architectures. The ultimate goals of the framework are productivity, portability (of both the code and performance), and achieving a high performance on the target platform.
A stencil computation updates every grid point in a structured grid based on the values of its neighboring points. This class of computations occurs frequently in scientific and general purpose computing (e.g., in partial differential equation solvers or in image processing), justifying the focus on this kind of computation.
The proposed key ingredients to achieve the goals of productivity, portability, and performance are domain specific languages (DSLs) and the auto-tuning methodology.
The Patus stencil specification DSL allows the programmer to express a stencil computation in a concise way independently of hardware architecture-specific details. Thus, it increases the programmer productivity by disburdening her or him of low level programming model issues and of manually applying hardware platform-specific
code optimization techniques. The use of domain specific languages also implies code reusability: once implemented, the same stencil specification can be reused on different
hardware platforms, i.e., the specification code is portable across hardware architectures. Constructing the language to be geared towards a special purpose makes it amenable to more aggressive optimizations and therefore to potentially higher performance.
Auto-tuning provides performance and performance portability by automated adaptation of implementation-specific parameters to the characteristics of the hardware on which the code will run. By automating the process of parameter tuning â which essentially amounts to solving an integer programming problem in which the objective function is the number representing the code's performance as a function of the parameter configuration, â the system can also be used more productively than if the programmer had to fine-tune the code manually.
We show performance results for a variety of stencils, for which Patus was used to generate the corresponding implementations. The selection includes stencils taken from two real-world applications: a simulation of the temperature within the human body during hyperthermia cancer treatment and a seismic application. These examples demonstrate the framework's flexibility and ability to produce high performance code
Recommended from our members
Mapping numerical software onto distributed memory parallel systems
The aim of this thesis is to further the use of parallel computers, in particular distributed memory systems, by proving strategies for parallelisation and developing the core component of tools to aid scalar software porting. The ported code must not only efficiently exploit available parallel processing speed and distributed memory, but also enable existing users of the scalar code to use the parallel version with identical inputs and allow maintenance to be performed by the scalar code author in conjunction with the parallel code.
The data partition strategy has been used to parallelise an in-house solidification modelling code where all requirements for the parallel software were successfully met. To confirm the success of this parallelisation strategy, a much sterner test was used, parallelising the HARWELL-FLOW3D fluid flow package. The performance results of the parallel version clearly vindicate the conclusions of the first example. Speedup efficiencies of around 80 percent have been achieved on fifty processors for sizable models. In both these tests, the alterations to the code were fairly minor, maintaining the structure and style of the original scalar code which can easily be recognised by its original author.
The alterations made to these codes indicated the potential for parallelising tools since the alterations were fairly minor and usually mechanical in nature. The current generation of parallelising compilers rely heavily on heuristic guidance in parallel code generation and other decisions that may be better made by a human. As a result, the code they produce will almost certainly be inferior to manually produced code. Also, in order not to sacrifice parallel code quality when using tools, the scalar code analysis to identify inherent parallelism in a application code, as used in parallelising compilers, has been extended to eliminate dependencies conservatively assumed, since these dependencies can greatly inhibit parallelisation.
Extra information has been extracted both from control flow and from processing symbolic information. The tests devised to utilise this information enable the non-existence of a significant number of previously assumed dependencies to be proved. In some cases, the number of true dependencies has been more than halved.
The dependence graph produced is of sufficient quality to greatly aid the parallelisation, with user interaction and interpretation, parallelism detection and code transformation validity being less inhibited by assumed dependencies. The use of tools rather than the black box approach removes the handicaps associated with using heuristic methods, if any relevant heuristic methods exist
Exploiting spatial symmetries for solving Poisson's equation
This paper presents a strategy to accelerate virtually any Poisson solver by taking advantage of s spatial reflection symmetries. More precisely, we have proved the existence of an inexpensive block diagonalisation that transforms the original Poisson equation into a set of 2s fully decoupled subsystems then solved concurrently. This block diagonalisation is identical regardless of the mesh connectivity (structured or unstructured) and the geometric complexity of the problem, therefore applying to a wide range of academic and industrial configurations. In fact, it simplifies the task of discretising complex geometries since it only requires meshing a portion of the domain that is then mirrored implicitly by the symmetriesâ hyperplanes. Thus, the resulting meshes naturally inherit the exploited symmetries, and their memory footprint becomes 2s times smaller. Thanks to the subsystemsâ better spectral properties, iterative solvers converge significantly faster. Additionally, imposing an adequate grid pointsâ ordering allows reducing the operatorsâ footprint and replacing the standard sparse matrix-vector products with the sparse matrixmatrix product, a higher arithmetic intensity kernel. As a result, matrix multiplications are accelerated, and massive simulations become more affordable. Finally, we include numerical experiments based on a turbulent flow simulation and making state-of-theart solvers exploit a varying number of symmetries. On the one hand, algebraic multigrid and preconditioned Krylov subspace methods require up to 23% and 72% fewer iterations, resulting in up to 1.7x and 5.6x overall speedups, respectively. On the other, sparse direct solversâ memory footprint, setup and solution costs are reduced by up to 48%, 58% and 46%, respectively.This work has been financially supported by two competitive R+D projects: RETOtwin (PDC2021-120970-I00), given by MCIN/AEI/10.13039/501100011033 and European Union Next GenerationEU/PRTR, and FusionCAT (001-P-001722), given by Generalitat de Catalunya RIS3CAT-FEDER. Ădel Alsalti-Baldellou has also been supported by the predoctoral grants DIN2018-010061 and 2019-DI-90, given by MCIN/AEI/10.13039/501100011033 and the Catalan Agency for Management of University and Research Grants (AGAUR), respectively.Peer ReviewedPostprint (published version
Heterogeneous CPU/GPU co-execution of CFD simulations on the POWER9 architecture: Application to airplane aerodynamics
High fidelity Computational Fluid Dynamics simulations are generally
associated with large computing requirements, which are progressively acute
with each new generation of supercomputers. However, significant research
efforts are required to unlock the computing power of leading-edge systems,
currently referred to as pre-Exascale systems, based on increasingly complex
architectures. In this paper, we present the approach implemented in the
computational mechanics code Alya. We describe in detail the parallelization
strategy implemented to fully exploit the different levels of parallelism,
together with a novel co-execution method for the efficient utilization of
heterogeneous CPU/GPU architectures. The latter is based on a multi-code
co-execution approach with a dynamic load balancing mechanism. The assessment
of the performance of all the proposed strategies has been carried out for
airplane simulations on the POWER9 architecture accelerated with NVIDIA Volta
V100 GPUs
- âŠ