124 research outputs found

    High-performance tsunami modelling with modern GPU technology

    Get PDF
    PhD ThesisEarthquake-induced tsunamis commonly propagate in the deep ocean as long waves and develop into sharp-fronted surges moving rapidly coastward, which may be effectively simulated by hydrodynamic models solving the nonlinear shallow water equations (SWEs). Tsunamis can cause substantial economic and human losses, which could be mitigated through early warning systems given efficient and accurate modelling. Most existing tsunami models require long simulation times for real-world applications. This thesis presents a graphics processing unit (GPU) accelerated finite volume hydrodynamic model using the compute unified device architecture (CUDA) for computationally efficient tsunami simulations. Compared with a standard PC, the model is able to reduce run-time by a factor of > 40. The validated model is used to reproduce the 2011 Japan tsunami. Two source models were tested, one based on tsunami waveform inversion and another using deep-ocean tsunameters. Vertical sea surface displacement is computed by the Okada model, assuming instantaneous sea-floor deformation. Both source models can reproduce the wave propagation at offshore and nearshore gauges, but the tsunameter-based model better simulates the first wave amplitude. Effects of grid resolutions between 450-3600 m, slope limiters, and numerical accuracy are also investigated for the simulation of the 2011 Japan tsunami. Grid resolutions of 1-2 km perform well with a proper limiter; the Sweby limiter is optimal for coarser resolutions, recovers wave peaks better than minmod, and is more numerically stable than Superbee. One hour of tsunami propagation can be predicted in 50 times on a regular low-cost PC-hosted GPU, compared to a single CPU. For 450 m resolution on a larger-memory server-hosted GPU, performance increased by ~70 times. Finally, two adaptive mesh refinement (AMR) techniques including simplified dynamic adaptive grids on CPU and a static adaptive grid on GPU are introduced to provide multi-scale simulations. Both can reduce run-time by ~3 times while maintaining acceptable accuracy. The proposed computationally-efficient tsunami model is expected to provide a new practical tool for tsunami modelling for different purposes, including real-time warning, evacuation planning, risk management and city planning

    Automatic thread distribution for nested parallelism in OpenMP

    Full text link

    A PHYSICS-BASED APPROACH TO MODELING WILDLAND FIRE SPREAD THROUGH POROUS FUEL BEDS

    Get PDF
    Wildfires are becoming increasingly erratic nowadays at least in part because of climate change. CFD (computational fluid dynamics)-based models with the potential of simulating extreme behaviors are gaining increasing attention as a means to predict such behavior in order to aid firefighting efforts. This dissertation describes a wildfire model based on the current understanding of wildfire physics. The model includes physics of turbulence, inhomogeneous porous fuel beds, heat release, ignition, and firebrands. A discrete dynamical system for flow in porous media is derived and incorporated into the subgrid-scale model for synthetic-velocity large-eddy simulation (LES), and a general porosity-permeability model is derived and implemented to investigate transport properties of flow through porous fuel beds. Note that these two developed models can also be applied to other situations for flow through porous media. Simulations of both grassland and forest fire spread are performed via an implicit LES code parallelized with OpenMP; the parallel performance of the algorithms are presented and discussed. The current model and numerical scheme produce reasonably correct wildfire results compared with previous wildfire experiments and simulations, but using coarser grids, and presenting complicated subgrid-scale behaviors. It is concluded that this physics-based wildfire model can be a good learning tool to examine some of the more complex wildfire behaviors, and may be predictive in the near future

    Parallel Lagrangian particle transport : application to respiratory system airways

    Get PDF
    This thesis is focused on particle transport in the context of high computing performance (HPC) in its widest range, from the numerical modeling to the physics involved, including its parallelization and post-process. The main goal is to obtain a general framework that enables understanding all the requirements and characteristics of particle transport using the Lagrangian frame of reference. Although the idea is to provide a suitable model for any engineering application that involves particle transport simulation, this thesis uses the respiratory system framework. This means that all the simulations are focused on this topic, including the benchmarks for testing, verifying and optimizing the results. Other applications, such as combustion, ocean residuals, or automotive, have also been simulated by other researchers using the same numerical model proposed here. However, they have not been included here in the interest of allowing the project to advance in a specific direction, and facilitate the structure and comprehension of this work. Human airways and respiratory system simulations are of special interest for medical purposes. Indeed, human airways can be significantly different in every individual. This complicates the study of drug delivery efficiency, deposition of polluted particles, etc., using classic in-vivo or in-vitro techniques. In other words, flow and deposition results may vary depending on the geometry of the patient and simulations allow customized studies using specific geometries. With the help of the new computational techniques, in the near future it may be possible to optimize nasal drugs delivery, surgery or other medical studies for each individual patient though a more personalized medicine. In summary, this thesis prioritizes numerical modeling, wide usability, performance, parallelization, and the study of the physics that affects particle transport. In addition, the simulation of the respiratory system should carry out interesting biological and medical results. However, the interpretation of these results will be only done from a pure numerical point of view.Aquesta tesi se centra en el transport de partícules dins el context de la computació d'alt rendiment (HPC), en el seu ventall més ampli; des del model numèric fins a la física involucrada, incloent-hi la part de paral·lelització del codi i de post-procés. L'objectiu principal és obtenir un esquema general que permeti entendre tant els requeriments com les característiques del transport de partícules fent servir el marc de referència Lagrangià. Encara que la idea sigui definir un model capaç¸ de simular qualsevol aplicació en el camp de l'enginyeria que involucri el transport de partícules, aquesta tesi utilitza el sistema respiratori com a temàtica de referència. Això significa que totes les simulacions estan emmarcades en aquest camp d'estudi, incloent-hi els tests de referència, verificacions i optimitzacions de resultats. L'estudi d'altres aplicacions, com ara la combustió, els residus oceànics, l'automoció o l'aeronàutica també han estat dutes a terme per altres investigadors utilitzant el mateix model numèric proposat aquí. Tot i així, aquests resultats no han estat inclosos en aquesta tesi per simplificar-la i avançar en una sola direcció; facilitant així l'estructura i millor comprensió d'aquest treball. Pel que fa al sistema respiratori humà i les seves simulacions, tenen especial interès per a propòsits mèdics. Particularment, la geometria dels conductes respiratoris pot variar de manera considerable en cada persona. Això complica l'estudi en aspectes com el subministrament de medicaments o la deposició de partícules contaminants, per exemple, utilitzant les tècniques clàssiques de laboratori (in-vivo o in-vitro). En altres paraules, tant el flux com la deposició poden canviar en funció de la geometria del pacient i aquí és on les simulacions permeten estudis adaptats a geometries concretes. Gràcies a les noves tècniques de computació, en un futur proper és probable que puguem optimitzar el subministrament de medicaments per via nasal, la cirurgia o altres estudis mèdics per a cada pacient mitjançant una medicina més personalitzada. En resum, aquesta tesi prioritza el model numèric, l'amplitud d'usos, el rendiment, la paral·lelització i l'estudi de la física que afecta directament a les partícules. A més, el fet de basar les nostres simulacions en el sistema respiratori dota aquesta tesi d'un interès biològic i mèdic pel que fa als resultats

    An Application-Based Performance Characterization of the Columbia Supercluster

    Get PDF
    Columbia is a 10,240-processor supercluster consisting of 20 Altix nodes with 512 processors each, and currently ranked as the second-fastest computer in the world. In this paper, we present the performance characteristics of Columbia obtained on up to four computing nodes interconnected via the InfiniBand and/or NUMAlink4 communication fabrics. We evaluate floating-point performance, memory bandwidth, message passing communication speeds, and compilers using a subset of the HPC Challenge benchmarks, and some of the NAS Parallel Benchmarks including the multi-zone versions. We present detailed performance results for three scientific applications of interest to NASA, one from molecular dynamics, and two from computational fluid dynamics. Our results show that both the NUMAlink4 and the InfiniBand hold promise for application scaling to a large number of processors

    Transformations of High-Level Synthesis Codes for High-Performance Computing

    Full text link
    Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

    Exécution structurée d'applications OpenMP à grain fin sur architectures multicoeurs

    Get PDF
    Les architectures multiprocesseurs contemporaines, qui se font naturellement l'écho de l'évolution actuelle des microprocesseurs vers des puces massivement multicœur, exhibent un parallélisme de plus en plus hiérarchique. Pour s'approcher des performances théoriques de ces machines, il faut désormais extraire un parallélisme de plus en plus fin des applications, mais surtout communiquer sa structure — et si possible des directives d'ordonnancement — au support d'exécution sous-jacent. Dans cet article, nous expliquons pourquoi OpenMP est un excellent vecteur pour extraire des applications du parallélisme massif, structuré et annoté et nous montrons comment, au moyen d'une extension du compilateur GNU OpenMP s'appuyant sur un ordonnanceur de threads NUMA-aware, il est possible d'exécuter efficacement des applications dynamiques et irrégulières en préservant l'affinité des threads et des données

    Parallelization Strategies for Modern Computing Platforms: Application to Illustrative Image Processing and Computer Vision Applications

    Get PDF
    RÉSUMÉ L’évolution spectaculaire des technologies dans le domaine du matériel et du logiciel a permis l’émergence des nouvelles plateformes parallèles très performantes. Ces plateformes ont marqué le début d’une nouvelle ère de la computation et il est préconisé qu’elles vont rester dans le domaine pour une bonne période de temps. Elles sont présentes déjà dans le domaine du calcul de haute performance (en anglais HPC, High Performance Computer) ainsi que dans le domaine des systèmes embarqués. Récemment, dans ces domaines le concept de calcul hétérogène a été adopté pour atteindre des performances élevées. Ainsi, plusieurs types de processeurs sont utilisés, dont les plus populaires sont les unités centrales de traitement ou CPU (de l’anglais Central Processing Unit) et les processeurs graphiques ou GPU (de l’anglais Graphics Processing Units). La programmation efficace pour ces nouvelles plateformes parallèles amène actuellement non seulement des opportunités mais aussi des défis importants pour les concepteurs. Par conséquent, l’industrie a besoin de l’appui de la communauté de recherche pour assurer le succès de ce nouveau changement de paradigme vers le calcul parallèle. Trois défis principaux présents pour les processeurs GPU massivement parallèles (ou “many-cores”) ainsi que pour les processeurs CPU multi-coeurs sont: (1) la sélection de la meilleure plateforme parallèle pour une application donnée, (2) la sélection de la meilleure stratégie de parallèlisation et (3) le réglage minutieux des performances (ou en anglais performance tuning) pour mieux exploiter les plateformes existantes. Dans ce contexte, l’objectif global de notre projet de recherche est de définir de nouvelles solutions pour aider à la programmation efficace des applications complexes sur les plateformes parallèles modernes. Les principales contributions à la recherche sont: 1. L’évaluation de l’efficacité d’accélération pour plusieurs plateformes parallèles, dans le cas des applications de calcul intensif. 2. Une analyse quantitative des stratégies de parallèlisation et implantation sur les plateformes à base de processeurs CPU multi-cœur ainsi que pour les plateformes à base de processeurs GPU massivement parallèles. 3. La définition et la mise en place d’une approche de réglage de performances (en Anglais performance tuning) pour les plateformes parallèles. Les contributions proposées ont été validées en utilisant des applications réelles illustratives et un ensemble varié de plateformes parallèles modernes.----------ABSTRACT With the technology improvement for both hardware and software, parallel platforms started a new computing era and they are here to stay. Parallel platforms may be found in High Performance Computers (HPC) or embedded computers. Recently, both HPC and embedded computers are moving toward heterogeneous computing platforms. They are employing both Central Processing Units (CPUs) and Graphics Processing Units (GPUs) to achieve the highest performance. Programming efficiently for parallel platforms brings new opportunities but also several challenges. Therefore, industry needs help from the research community to succeed in its recent dramatic shift to parallel computing. Parallel programing presents several major challenges. These challenges are equally present whether one programs on a many-core GPU or on a multi-core CPU. Three of the main challenges are: (1) Finding the best platform providing the required acceleration (2) Select the best parallelization strategy (3) Performance tuning to efficiently leverage the parallel platforms. In this context, the overall objective of our research is to propose a new solution helping designers to efficiently program complex applications on modern parallel architectures. The contributions of this thesis are: 1. The evaluation of the efficiency of several target parallel platforms to speedup compute-intensive applications. 2. The quantitative analysis for parallelization and implementation strategies on multicore CPUs and many-core GPUs. 3. The definition and implementation of a new performance tuning framework for heterogeneous parallel platforms. The contributions were validated using real computation intensive applications and modern parallel platform based on multi-core CPU and many-core GPU

    Development of a Parallel Computational Framework to Solve Flow and Transport in Integrated Surface-Subsurface Hydrologic Systems

    Get PDF
    HydroGeoSphere (HGS) is a 3D control-volume finite element hydrologic model describing fully-integrated surface-subsurface water flow and solute and thermal energy transport. Because the model solves tightly-coupled highly-nonlinear partial differential equations, often applied at regional and continental scales (for example, to analyze the impact of climate change on water resources), high performance computing (HPC) is essential. The target parallelization includes the composition of the Jacobian matrix for the iterative linearization method and the sparse-matrix solver, preconditioned BiCGSTAB. The Jacobian matrix assembly is parallelized by using a static scheduling scheme with taking account into data racing conditions, which may occur during the matrix construction. The parallelization of the solver is achieved by partitioning the domain into equal-size sub-domains, with an efficient reordering scheme. The computational flow of the BiCGSTAB solver is also modified to reduce the parallelization overhead and to be suitable for parallel architectures. The parallelized model is tested on several benchmark cases that include linear and nonlinear problems involving various domain sizes and degrees of hydrologic complexity. The performance is evaluated in terms of computational robustness and efficiency, using standard scaling performance measures. Simulation profiling results indicate that the efficiency becomes higher for three situations: 1) with an increasing number of nodes/elements in the mesh because the work load per CPU decreases with increasing the number of nodes, which reduces the relative portion of parallel overhead in total computing time., 2) for increasingly nonlinear transient simulations because this makes the coefficient matrix diagonal dominance, and 3) with domains of irregular geometry that increases condition number. These characteristics are promising for the large-scale analysis of water resource problems that involve integrated surface-subsurface flow regimes. Large-scale real-world simulations illustrate the importance of node reordering, which is associated with the process of the domain partitioning. With node reordering, super-scalarable parallel speedup was obtained when compared to a serial simulation performed with natural node ordering. The results indicate that the number of iterations increases as the number of threads increases due to the increased number of elements in the off-diagonal blocks in the coefficient matrix. In terms of the privatization scheme, the parallel efficiency with privatization was higher than that with the shared scheme for most of simulations performed
    • …
    corecore