Search CORE

369 research outputs found

Adaptive BDDC in Three Dimensions

Author: Amestoy
Bedřich Sousedík
Brenner
Demmel
Dohrmann
Farhat
Farhat
Fish
Fragakis
Jakub Šístek
Jan Mandel
Klawonn
Klawonn
Klawonn
Klawonn
Knyazev
Kruis
Le Tallec
Li
Mandel
Mandel
Mandel
Mandel
Mandel
Mandel
Mandel
Mandel
Pechstein
Pechstein
Poole
Smith
Sousedík
Toselli
Šístek
Publication venue: 'Elsevier BV'
Publication date: 28/02/2011
Field of study

The adaptive BDDC method is extended to the selection of face constraints in three dimensions. A new implementation of the BDDC method is presented based on a global formulation without an explicit coarse problem, with massive parallelism provided by a multifrontal solver. Constraints are implemented by a projection and sparsity of the projected operator is preserved by a generalized change of variables. The effectiveness of the method is illustrated on several engineering problems.Comment: 28 pages, 9 figures, 9 table

arXiv.org e-Print Archive

Crossref

A GPU-accelerated package for simulation of flow in nanoporous source rocks with many-body dissipative particle dynamics

Author: Andrew Matthew
Blumers Ansel
Deo Milind
Goral Jan
Huang Hai
Kane Joshua
Li Zhen
Luo Lixiang
Tang Yu-Hang
Xia Yidong
Publication venue: 'Elsevier BV'
Publication date: 25/03/2019
Field of study

Mesoscopic simulations of hydrocarbon flow in source shales are challenging, in part due to the heterogeneous shale pores with sizes ranging from a few nanometers to a few micrometers. Additionally, the sub-continuum fluid-fluid and fluid-solid interactions in nano- to micro-scale shale pores, which are physically and chemically sophisticated, must be captured. To address those challenges, we present a GPU-accelerated package for simulation of flow in nano- to micro-pore networks with a many-body dissipative particle dynamics (mDPD) mesoscale model. Based on a fully distributed parallel paradigm, the code offloads all intensive workloads on GPUs. Other advancements, such as smart particle packing and no-slip boundary condition in complex pore geometries, are also implemented for the construction and the simulation of the realistic shale pores from 3D nanometer-resolution stack images. Our code is validated for accuracy and compared against the CPU counterpart for speedup. In our benchmark tests, the code delivers nearly perfect strong scaling and weak scaling (with up to 512 million particles) on up to 512 K20X GPUs on Oak Ridge National Laboratory's (ORNL) Titan supercomputer. Moreover, a single-GPU benchmark on ORNL's SummitDev and IBM's AC922 suggests that the host-to-device NVLink can boost performance over PCIe by a remarkable 40\%. Lastly, we demonstrate, through a flow simulation in realistic shale pores, that the CPU counterpart requires 840 Power9 cores to rival the performance delivered by our package with four V100 GPUs on ORNL's Summit architecture. This simulation package enables quick-turnaround and high-throughput mesoscopic numerical simulations for investigating complex flow phenomena in nano- to micro-porous rocks with realistic pore geometries

arXiv.org e-Print Archive

Clemson University: TigerPrints

A Combined MPI-CUDA Parallel Solution of Linear and Nonlinear Poisson-Boltzmann Equation

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2014
Field of study

Crossref

Generating and auto-tuning parallel stencil codes

Author: Christen Matthias-Michael
Publication venue
Publication date: 01/01/2011
Field of study

In this thesis, we present a software framework, Patus, which generates high performance stencil codes for different types of hardware platforms, including current multicore CPU and graphics processing unit architectures. The ultimate goals of the framework are productivity, portability (of both the code and performance), and achieving a high performance on the target platform. A stencil computation updates every grid point in a structured grid based on the values of its neighboring points. This class of computations occurs frequently in scientific and general purpose computing (e.g., in partial differential equation solvers or in image processing), justifying the focus on this kind of computation. The proposed key ingredients to achieve the goals of productivity, portability, and performance are domain specific languages (DSLs) and the auto-tuning methodology. The Patus stencil specification DSL allows the programmer to express a stencil computation in a concise way independently of hardware architecture-specific details. Thus, it increases the programmer productivity by disburdening her or him of low level programming model issues and of manually applying hardware platform-specific code optimization techniques. The use of domain specific languages also implies code reusability: once implemented, the same stencil specification can be reused on different hardware platforms, i.e., the specification code is portable across hardware architectures. Constructing the language to be geared towards a special purpose makes it amenable to more aggressive optimizations and therefore to potentially higher performance. Auto-tuning provides performance and performance portability by automated adaptation of implementation-specific parameters to the characteristics of the hardware on which the code will run. By automating the process of parameter tuning — which essentially amounts to solving an integer programming problem in which the objective function is the number representing the code's performance as a function of the parameter configuration, — the system can also be used more productively than if the programmer had to fine-tune the code manually. We show performance results for a variety of stencils, for which Patus was used to generate the corresponding implementations. The selection includes stencils taken from two real-world applications: a simulation of the temperature within the human body during hyperthermia cancer treatment and a seismic application. These examples demonstrate the framework's flexibility and ability to produce high performance code

edoc

Recommended from our members

Mapping numerical software onto distributed memory parallel systems

Author: Johnson Stephen Philip
Publication venue
Publication date: 01/02/1992
Field of study

The aim of this thesis is to further the use of parallel computers, in particular distributed memory systems, by proving strategies for parallelisation and developing the core component of tools to aid scalar software porting. The ported code must not only efficiently exploit available parallel processing speed and distributed memory, but also enable existing users of the scalar code to use the parallel version with identical inputs and allow maintenance to be performed by the scalar code author in conjunction with the parallel code. The data partition strategy has been used to parallelise an in-house solidification modelling code where all requirements for the parallel software were successfully met. To confirm the success of this parallelisation strategy, a much sterner test was used, parallelising the HARWELL-FLOW3D fluid flow package. The performance results of the parallel version clearly vindicate the conclusions of the first example. Speedup efficiencies of around 80 percent have been achieved on fifty processors for sizable models. In both these tests, the alterations to the code were fairly minor, maintaining the structure and style of the original scalar code which can easily be recognised by its original author. The alterations made to these codes indicated the potential for parallelising tools since the alterations were fairly minor and usually mechanical in nature. The current generation of parallelising compilers rely heavily on heuristic guidance in parallel code generation and other decisions that may be better made by a human. As a result, the code they produce will almost certainly be inferior to manually produced code. Also, in order not to sacrifice parallel code quality when using tools, the scalar code analysis to identify inherent parallelism in a application code, as used in parallelising compilers, has been extended to eliminate dependencies conservatively assumed, since these dependencies can greatly inhibit parallelisation. Extra information has been extracted both from control flow and from processing symbolic information. The tests devised to utilise this information enable the non-existence of a significant number of previously assumed dependencies to be proved. In some cases, the number of true dependencies has been more than halved. The dependence graph produced is of sufficient quality to greatly aid the parallelisation, with user interaction and interpretation, parallelism detection and code transformation validity being less inhibited by assumed dependencies. The use of tools rather than the black box approach removes the handicaps associated with using heuristic methods, if any relevant heuristic methods exist

Greenwich Academic Literature Archive

Exploiting spatial symmetries for solving Poisson's equation

Author: Alsalti Baldellou Àdel
Oliva Llena Asensio
Trias Miquel Francesc Xavier
Álvarez Farré Xavier
Publication venue: Elsevier
Publication date: 01/08/2023
Field of study

This paper presents a strategy to accelerate virtually any Poisson solver by taking advantage of s spatial reflection symmetries. More precisely, we have proved the existence of an inexpensive block diagonalisation that transforms the original Poisson equation into a set of 2s fully decoupled subsystems then solved concurrently. This block diagonalisation is identical regardless of the mesh connectivity (structured or unstructured) and the geometric complexity of the problem, therefore applying to a wide range of academic and industrial configurations. In fact, it simplifies the task of discretising complex geometries since it only requires meshing a portion of the domain that is then mirrored implicitly by the symmetries’ hyperplanes. Thus, the resulting meshes naturally inherit the exploited symmetries, and their memory footprint becomes 2s times smaller. Thanks to the subsystems’ better spectral properties, iterative solvers converge significantly faster. Additionally, imposing an adequate grid points’ ordering allows reducing the operators’ footprint and replacing the standard sparse matrix-vector products with the sparse matrixmatrix product, a higher arithmetic intensity kernel. As a result, matrix multiplications are accelerated, and massive simulations become more affordable. Finally, we include numerical experiments based on a turbulent flow simulation and making state-of-theart solvers exploit a varying number of symmetries. On the one hand, algebraic multigrid and preconditioned Krylov subspace methods require up to 23% and 72% fewer iterations, resulting in up to 1.7x and 5.6x overall speedups, respectively. On the other, sparse direct solvers’ memory footprint, setup and solution costs are reduced by up to 48%, 58% and 46%, respectively.This work has been financially supported by two competitive R+D projects: RETOtwin (PDC2021-120970-I00), given by MCIN/AEI/10.13039/501100011033 and European Union Next GenerationEU/PRTR, and FusionCAT (001-P-001722), given by Generalitat de Catalunya RIS3CAT-FEDER. Àdel Alsalti-Baldellou has also been supported by the predoctoral grants DIN2018-010061 and 2019-DI-90, given by MCIN/AEI/10.13039/501100011033 and the Catalan Agency for Management of University and Research Grants (AGAUR), respectively.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Heterogeneous CPU/GPU co-execution of CFD simulations on the POWER9 architecture: Application to airplane aerodynamics

Author: Borrell R.
Dosimont D.
Garcia-Gasulla M.
Houzeaux G.
Lehmkuhl O.
Mehta V.
Owen H.
Oyarzun G.
Vazquez M.
Publication venue: 'Elsevier BV'
Publication date: 01/06/2020
Field of study

High fidelity Computational Fluid Dynamics simulations are generally associated with large computing requirements, which are progressively acute with each new generation of supercomputers. However, significant research efforts are required to unlock the computing power of leading-edge systems, currently referred to as pre-Exascale systems, based on increasingly complex architectures. In this paper, we present the approach implemented in the computational mechanics code Alya. We describe in detail the parallelization strategy implemented to fully exploit the different levels of parallelism, together with a novel co-execution method for the efficient utilization of heterogeneous CPU/GPU architectures. The latter is based on a multi-code co-execution approach with a dynamic load balancing mechanism. The assessment of the performance of all the proposed strategies has been carried out for airplane simulations on the POWER9 architecture accelerated with NVIDIA Volta V100 GPUs

arXiv.org e-Print Archive

UPCommons. Portal del coneixement obert de la UPC