9 research outputs found
Proceedings of the Eighth Annual Thermal and Fluids Analysis Workshop: Spacecraft Analysis and Design
This document contains papers presented at the Eighth Annual Thermal and Fluids Analysis Workshop (TFAWS) on Spacecraft Analysis and Design hosted by the NASA/Johnson Space Center (JSC) on September 8-11, 1997, and held at the University of Houston - Clear Lake (UHCL) in the Bayou Building. The Workshop was sponsored by NASA/JSC. Seminars were hosted and technical papers were provided in fluid and thermal dynamics. Seminars were given in GASP, SINDA, SINAPS Plus, TSS, and PHOENICS. Seventeen papers were presented
A GPU-Accelerated Modern Fortran Version of the ECHO Code for Relativistic Magnetohydrodynamics
The numerical study of relativistic magnetohydrodynamics (MHD) plays a
crucial role in high-energy astrophysics, but unfortunately is computationally
demanding, given the complex physics involved (high Lorentz factor flows,
extreme magnetization, curved spacetimes near compact objects) and the large
variety of spatial scales needed to resolve turbulent motions. A great benefit
comes from the porting of existing codes running on standard processors to
GPU-based platforms. However, this usually requires a drastic rewriting of the
original code, the use of specific languages like CUDA, and a complex analysis
of data management and optimization of parallel processes. Here we describe the
porting of the ECHO code for special and general relativistic MHD to
accelerated devices, simply based on native Fortran language built-in
constructs, especially 'do concurrent' loops, few OpenACC directives, and the
straightforward data management provided by the Unified Memory option of NVIDIA
compilers.Thanks to these very minor modifications to the original code, the
new version of ECHO runs at least 16 times faster on GPU platforms compared to
CPU-based ones. The chosen benchmark is the 3D propagation of a relativistic
MHD Alfv\'en wave, for which strong and weak scaling tests performed on the
LEONARDO pre-exascale supercomputer at CINECA are provided (using up to 256
nodes corresponding to 1024 GPUs, and over 14 billion cells). Finally, an
example of high-resolution relativistic MHD Alfv\'enic turbulence simulation is
shown, demonstrating the potential for astrophysical plasmas of the new
GPU-based version of ECHO.Comment: Accepted for publication on Fluids, MDPI, 17 page
Recommended from our members
On Tenth Order Central Spatial Schemes
This paper explores the performance of the tenth-order central spatial scheme and derives the accompanying energy-norm stable summation-by-parts (SBP) boundary operators. The objective is to employ the resulting tenth-order spatial differencing with the stable SBP boundary operators as a base scheme in the framework of adaptive numerical dissipation control in high order multistep filter schemes of Yee et al. (1999), Yee and Sj{umlt o}green (2002, 2005, 2006, 2007), and Sj{umlt o}green and Yee (2004). These schemes were designed for multiscale turbulence flows including strong shock waves and combustion
PDAC: A Data Parallel Algorithm for the Performance Analysis of Closed Queueing Networks
Abstract. A parallel distribution analysis by chain algorithm (PDAC) is presented for the performance analysis of closed, multiple class queueing networks. The PDAC algorithm uses data parallel computation of the summation indices needed to compute the joint queue length probabilities. The computational cost of the PDAC algorithm is shown to be of polynomial order with a lower degree than the cost of the serial implementation of the DAC algorithm. Examples are presented comparing the PDAC algorithm with the DAC algorithm to illustrate its advantages and limitations
Methods for Multilevel Parallelism on GPU Clusters: Application to a Multigrid Accelerated Navier-Stokes Solver
Computational Fluid Dynamics (CFD) is an important field in high performance computing with numerous applications. Solving problems in thermal and fluid sciences demands enormous computing resources and has been one of the primary applications used on supercomputers and large clusters. Modern graphics processing units (GPUs) with many-core architectures have emerged as general-purpose parallel computing platforms that can accelerate simulation science applications substantially. While significant speedups have been obtained with single and multiple GPUs on a single workstation, large problems require more resources. Conventional clusters of central processing units (CPUs) are now being augmented with GPUs in each compute-node to tackle large problems.
The present research investigates methods of taking advantage of the multilevel parallelism in multi-node, multi-GPU systems to develop scalable simulation science software. The primary application the research develops is a cluster-ready GPU-accelerated Navier-Stokes incompressible flow solver that includes advanced numerical methods, including a geometric multigrid pressure Poisson solver. The research investigates multiple implementations to explore computation / communication overlapping methods. The research explores methods for coarse-grain parallelism, including POSIX threads, MPI, and a hybrid OpenMP-MPI model. The application includes a number of usability features, including periodic VTK (Visualization Toolkit) output, a run-time configuration file, and flexible setup of obstacles to represent urban areas and complex terrain. Numerical features include a variety of time-stepping methods, buoyancy-drivenflow, adaptive time-stepping, various iterative pressure solvers, and a new parallel 3D geometric multigrid solver. At each step, the project examines performance and scalability measures using the Lincoln Tesla cluster at the National Center for Supercomputing Applications (NCSA) and the Longhorn cluster at the Texas Advanced Computing Center (TACC). The results demonstrate that multi-GPU clusters can substantially accelerate computational fluid dynamics simulations
Development of Modelling Techniques for Pulsed Pressure Chemical Vapour Deposition (PP-CVD)
In this thesis, a numerical and theoretical investigation of the Pulsed Pressure Chemical
Vapour Deposition (PP-CVD) progress is presented. This process is a novel method for the
deposition of thin films of materials from either liquid or gaseous precursors. PP-CVD
operates in an unsteady manner whereby timed pulsed of the precursor are injected into a
continuously evacuated reactor volume.
A non-dimensional parameter indicating the extent of continuum breakdown under strong
temporal gradients is developed. Experimental measurements, supplemented by basic
continuum simulations, reveal that spatio-temporal breakdown of the continuum condition
occurs within the reactor volume. This means that the use of continuum equation based
solvers for modelling the flow field is inappropriate. In this thesis, appropriate methods are
developed for modelling unsteady non-continuum flows, centred on the particle-based Direct
Simulation Monte Carlo (DSMC) method.
As a first step, a basic particle tracking method and single processor DSMC code are used to
investigate the physical mechanisms for the high precursor conversion efficiency and
deposition uniformity observed in experimental reactors. This investigation reveals that at
soon after the completion of the PP-CVD injection phase, the precursor particles have an
approximately uniform distribution within the reactor volume. The particles then simply
diffuse to the substrate during the pump-down phase, during which the rate of diffusion
greatly exceeds the rate at which particles can be removed from the reactor. Higher precursor
conversion efficiency was found to correlate with smaller size carrier gas molecules and
moderate reactor peak pressure.
An unsteady sampling routine for a general parallel DSMC method called PDSC, allowing the
simulation of time-dependent flow problems in the near continuum range, is then developed
in detail. Nearest neighbour collision routines are also implemented and verified for this code.
A post-processing procedure called DSMC Rapid Ensemble Averaging Method (DREAM) is
developed to improve the statistical scatter in the results while minimising both memory and
simulation time. This method builds an ensemble average of repeated runs over small number
of sampling intervals prior to the sampling point of interest by restarting the flow using either
xi
a Maxwellian distribution based on macroscopic properties for near equilibrium flows
(DREAM-I) or output instantaneous particle data obtained by the original unsteady sampling
of PDSC for strongly non-equilibrium flows (DREAM-II). The method is validated by
simulating shock tube flow and the development of simple Couette flow. Unsteady PDSC is
found to accurately predict the flow field in both cases with significantly reduced run-times
over single processor code and DREAM greatly reduces the statistical scatter in the results
while maintaining accurate particle velocity distributions. Verification simulations are
conducted involving the interaction of shocks over wedges and a benchmark study against
other DSMC code is conducted.
The unsteady PDSC routines are then used to simulate the PP-CVD injection phase. These
simulations reveal the complex flow phenomena present during this stage. The initial
expansion is highly unsteady; however a quasi-steady jet structure forms within the reactor
after this initial stage. The simulations give additional evidence that the collapse of the jet at
the end of the injection phase results in an approximately uniform distribution of precursor
throughout the reactor volume.
Advanced modelling methods and the future work required for development of the PP-CVD
method are then proposed. These methods will allow all configurations of reactor to be
modelled while reducing the computational expense of the simulations
Recommended from our members
A Programmable Streaming Framework for Extreme-Scale Scientific Visualizations
Emerging computational and acquisition technologies are empowering scientists to conduct simulations and experiments on an unprecedented scale. These advancements can push the frontiers of science and technology with groundbreaking discoveries. However, they also pose significant challenges to traditional scientific visualization workflows. Firstly, the data generated by modern scientific studies using these technologies tends to be extremely large and complex, often resulting in slow processing and rendering times. This demands the development of visualization algorithms that can effectively scale with the size of the data. Secondly, state-of-the-art simulations and experiments produce data at extraordinary rates, complicating the task of generating valuable visualization results for scientists. Therefore, there's a pressing need for more adaptive and intelligent visualization workflows. Lastly, although new computer hardware and architecture can speed up the visualization process, significant performance variations still exist among visualization algorithms due to differing design choices. As a result, optimizing algorithms to better leverage emerging hardware features for enhanced efficiency remains an ongoing necessity.This dissertation addresses the aforementioned challenges by introducing a programmable streaming framework enhanced with implicit neural representation, designed for visualizing extreme-scale scientific data. Specifically, it unfolds three innovative methodologies:Firstly, the framework offers a reactive and declarative programming language for streamlining image generation, layout and interaction creation, and I/O processes, eliminating the need for users to manually control all visualization parameters and procedures. This language enables scientists to define highly adaptive visualization workflows through high-level, rule-based grammars. The system then automatically optimizes the low-level implementation according to these specifications, facilitating the creation of more efficient visualization workflows with simpler coding.Secondly, the framework features a scalable, hardware-accelerated streaming visualization system that allows visualization processes to run concurrently with I/O operations. This system not only achieves state-of-the-art scalability but can also effectively manages complex, multi-resolution data structures. It delivers accurate rendering outcomes, reduces memory usage, and leverages emerging hardware capabilities more efficiently.Finally, the framework integrates implicit neural representation (INR) techniques for data compression and interactive visualization. The use of INRs significantly reduces data size while preserving high-frequency details. Additionally, it enables direct access to spatial locations at any desired resolution, obviating the need for decompression or interpolation.In summary, this dissertation research addresses long-standing challenges inherent in extreme-scale scientific visualization by introducing novel designs and methodologies. The presented framework not only enables more efficient and adaptive visualization workflows but also leverages the latest hardware acceleration and data compression techniques. The implications of these advancements extend beyond mere technical improvements; they pave the way for deeper insights and discoveries across a broad spectrum of scientific studies. This research, therefore, represents a significant leap forward, with the potential to transform the landscape of scientific visualization
Optimizaci贸n del rendimiento y la eficiencia energ茅tica en sistemas masivamente paralelos
RESUMEN Los sistemas heterog茅neos son cada vez m谩s relevantes, debido a sus capacidades de rendimiento y eficiencia energ茅tica, estando presentes en todo tipo de plataformas de c贸mputo, desde dispositivos embebidos y servidores, hasta nodos HPC de grandes centros de datos. Su complejidad hace que sean habitualmente usados bajo el paradigma de tareas y el modelo de programaci贸n host-device. Esto penaliza fuertemente el aprovechamiento de los aceleradores y el consumo energ茅tico del sistema, adem谩s de dificultar la adaptaci贸n de las aplicaciones.
La co-ejecuci贸n permite que todos los dispositivos cooperen para computar el mismo problema, consumiendo menos tiempo y energ铆a. No obstante, los programadores deben encargarse de toda la gesti贸n de los dispositivos, la distribuci贸n de la carga y la portabilidad del c贸digo entre sistemas, complicando notablemente su programaci贸n.
Esta tesis ofrece contribuciones para mejorar el rendimiento y la eficiencia energ茅tica en estos sistemas masivamente paralelos. Se realizan propuestas que abordan objetivos generalmente contrapuestos: se mejora la usabilidad y la programabilidad, a la vez que se garantiza una mayor abstracci贸n y extensibilidad del sistema, y al mismo tiempo se aumenta el rendimiento, la escalabilidad y la eficiencia energ茅tica. Para ello, se proponen dos motores de ejecuci贸n con enfoques completamente distintos.
EngineCL, centrado en OpenCL y con una API de alto nivel, favorece la m谩xima compatibilidad entre todo tipo de dispositivos y proporciona un sistema modular extensible. Su versatilidad permite adaptarlo a entornos para los que no fue concebido, como aplicaciones con ejecuciones restringidas por tiempo o simuladores HPC de din谩mica molecular, como el utilizado en un centro de investigaci贸n internacional.
Considerando las tendencias industriales y enfatizando la aplicabilidad profesional, CoexecutorRuntime proporciona un sistema flexible centrado en C++/SYCL que dota de soporte a la co-ejecuci贸n a la tecnolog铆a oneAPI. Este runtime acerca a los programadores al dominio del problema, posibilitando la explotaci贸n de estrategias din谩micas adaptativas que mejoran la eficiencia en todo tipo de aplicaciones.ABSTRACT Heterogeneous systems are becoming increasingly relevant, due to their performance and energy efficiency capabilities, being present in all types of computing platforms, from embedded devices and servers to HPC nodes in large data centers. Their complexity implies that they are usually used under the task paradigm and the host-device programming model. This strongly penalizes accelerator utilization and system energy consumption, as well as making it difficult to adapt applications.
Co-execution allows all devices to simultaneously compute the same problem, cooperating to consume less time and energy. However, programmers must handle all device management, workload distribution and code portability between systems, significantly complicating their programming.
This thesis offers contributions to improve performance and energy efficiency in these massively parallel systems. The proposals address the following generally conflicting objectives: usability and programmability are improved, while ensuring enhanced system abstraction and extensibility, and at the same time performance, scalability and energy efficiency are increased. To achieve this, two runtime systems with completely different approaches are proposed.
EngineCL, focused on OpenCL and with a high-level API, provides an extensible modular system and favors maximum compatibility between all types of devices. Its versatility allows it to be adapted to environments for which it was not originally designed, including applications with time-constrained executions or molecular dynamics HPC simulators, such as the one used in an international research center.
Considering industrial trends and emphasizing professional applicability, CoexecutorRuntime provides a flexible C++/SYCL-based system that provides co-execution support for oneAPI technology. This runtime brings programmers closer to the problem domain, enabling the exploitation of dynamic adaptive strategies that improve efficiency in all types of applications.Funding: This PhD has been supported by the Spanish Ministry of Education (FPU16/03299 grant),
the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R
and PID2019-105660RB-C22.
This work has also been partially supported by the Mont-Blanc 3: European Scalable and
Power Efficient HPC Platform based on Low-Power Embedded Technology project (G.A. No.
671697) from the European Union鈥檚 Horizon 2020 Research and Innovation Programme
(H2020 Programme). Some activities have also been funded by the Spanish Science and Technology
Commission under contract TIN2016-81840-REDT (CAPAP-H6 network).
The Integration II: Hybrid programming models of Chapter 4 has been partially performed
under the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the EC
Research Innovation Action under the H2020 Programme. In particular, the author gratefully
acknowledges the support of the SPMT Department of the High Performance Computing
Center Stuttgart (HLRS)