35 research outputs found
Simulating Radiating and Magnetized Flows in Multi-Dimensions with ZEUS-MP
This paper describes ZEUS-MP, a multi-physics, massively parallel, message-
passing implementation of the ZEUS code. ZEUS-MP differs significantly from the
ZEUS-2D code, the ZEUS-3D code, and an early "version 1" of ZEUS-MP distributed
publicly in 1999. ZEUS-MP offers an MHD algorithm better suited for
multidimensional flows than the ZEUS-2D module by virtue of modifications to
the Method of Characteristics scheme first suggested by Hawley and Stone
(1995), and is shown to compare quite favorably to the TVD scheme described by
Ryu et. al (1998). ZEUS-MP is the first publicly-available ZEUS code to allow
the advection of multiple chemical (or nuclear) species. Radiation hydrodynamic
simulations are enabled via an implicit flux-limited radiation diffusion (FLD)
module. The hydrodynamic, MHD, and FLD modules may be used in one, two, or
three space dimensions. Self gravity may be included either through the
assumption of a GM/r potential or a solution of Poisson's equation using one of
three linear solver packages (conjugate-gradient, multigrid, and FFT) provided
for that purpose. Point-mass potentials are also supported. Because ZEUS-MP is
designed for simulations on parallel computing platforms, considerable
attention is paid to the parallel performance characteristics of each module.
Strong-scaling tests involving pure hydrodynamics (with and without
self-gravity), MHD, and RHD are performed in which large problems (256^3 zones)
are distributed among as many as 1024 processors of an IBM SP3. Parallel
efficiency is a strong function of the amount of communication required between
processors in a given algorithm, but all modules are shown to scale well on up
to 1024 processors for the chosen fixed problem size.Comment: Accepted for publication in the ApJ Supplement. 42 pages with 29
inlined figures; uses emulateapj.sty. Discussions in sections 2 - 4 improved
per referee comments; several figures modified to illustrate grid resolution.
ZEUS-MP source code and documentation available from the Laboratory for
Computational Astrophysics at http://lca.ucsd.edu/codes/currentcodes/zeusmp2
Doctor of Philosophy
dissertationRecent trends in high performance computing present larger and more diverse computers using multicore nodes possibly with accelerators and/or coprocessors and reduced memory. These changes pose formidable challenges for applications code to attain scalability. Software frameworks that execute machine-independent applications code using a runtime system that shields users from architectural complexities oer a portable solution for easy programming. The Uintah framework, for example, solves a broad class of large-scale problems on structured adaptive grids using fluid-flow solvers coupled with particle-based solids methods. However, the original Uintah code had limited scalability as tasks were run in a predefined order based solely on static analysis of the task graph and used only message passing interface (MPI) for parallelism. By using a new hybrid multithread and MPI runtime system, this research has made it possible for Uintah to scale to 700K central processing unit (CPU) cores when solving challenging fluid-structure interaction problems. Those problems often involve moving objects with adaptive mesh refinement and thus with highly variable and unpredictable work patterns. This research has also demonstrated an ability to run capability jobs on the heterogeneous systems with Nvidia graphics processing unit (GPU) accelerators or Intel Xeon Phi coprocessors. The new runtime system for Uintah executes directed acyclic graphs of computational tasks with a scalable asynchronous and dynamic runtime system for multicore CPUs and/or accelerators/coprocessors on a node. Uintah's clear separation between application and runtime code has led to scalability increases without significant changes to application code. This research concludes that the adaptive directed acyclic graph (DAG)-based approach provides a very powerful abstraction for solving challenging multiscale multiphysics engineering problems. Excellent scalability with regard to the different processors and communications performance are achieved on some of the largest and most powerful computers available today
Solving Hyperbolic PDEs using Accelerator Architectures
Accelerator architectures are used to accelerate the
simulation of nonlinear hyperbolic PDEs. Three different architectures, a multicore
CPU using threading, IBM’s Cell Processor, and Nvidia’s Tesla GPUs are investigated. Speed-ups of between 40-75× relative to a single CPU core in single precision are obtained using the Cell processor and the GPU. The three implementations are extended to parallel computing clusters by making use
of the Message Passing Interface (MPI). The resulting hybrid-parallel code is investigated
for performance and scalability on both a GPU and Cell computing cluster
Intelligent instrumentation techniques to improve the traces information-volume ratio
With ever more powerful machines being constantly deployed, it is crucial to manage the computational resources efficiently. This is important both from the point of view of the individual user, who expects fast results; and the supercomputing center hosting the whole infrastructure, that is interested in maximizing its overall productivity. Nevertheless, the real sustained performance achieved by the applications can be significantly lower than the theoretical peak performance of the machines. A key factor to bridge this performance gap is to understand how parallel computers behave.
Performance analysis tools are essential not only to understand the behavior of parallel applications, but to identify why performance expectations might not have been met, serving as guidelines to improve the inefficiencies that caused poor performance, and driving both software and hardware optimizations. However, detailed analysis of the behavior of a parallel application requires to process a large amount of data that also grows extremely fast.
Current large scale systems already comprise hundreds of thousands of cores, and upcoming exascale systems are expected to assemble more than a million processing elements. With such number of hardware components, the traditional analysis methodologies consisting in blindly collecting as much data as possible and then performing exhaustive lookups are no longer applicable, because the volume of performance data generated becomes absolutely unmanageable to store, process and analyze. The evolution of the tools suggests that more complex approaches are needed, incorporating intelligence to perform competently the challenging and important task of detailed analysis.
In this thesis, we address the problem of scalability of performance analysis tools in large scale systems. In such scenarios, in-depth understanding of the interactions between all the system components is more compelling than ever for an effective use of the parallel resources. To this end, our work includes a thorough review of techniques that have been successfully applied to aid in the task of Big Data Analytics in fields like machine learning, data mining, signal processing and computer vision. We have leveraged these techniques to improve the analysis of large-scale parallel applications by automatically uncovering repetitive patterns, finding data correlations, detecting performance trends and further useful analysis information. Combinining their use, we have minimized the volume of performance data captured from an execution, while maximizing the benefit and insight gained from this data, and have proposed new and more effective methodologies for single and multi-experiment performance analysis.Con el incesante aumento de potencia y capacidad de los superordenadores, la habilidad de emplear de forma efectiva todos los recursos disponibles se ha convertido en un factor crucial. La necesidad de un uso eficiente radica tanto en la aspiración de los usuarios por obtener resultados en el menor tiempo posible, como en el interés del propio centro de cálculo que alberga la infraestructura computacional por maximizar la productividad de los recursos. Sin embargo, el rendimiento real que las aplicaciones son capaces de alcanzar suele ser significativamente menor que el rendimiento teórico de las máquinas. Y la clave para salvar esta distancia consiste en comprender el comportamiento de las máquinas paralelas. Las herramientas de análisis de rendimiento son instrumentos fundamentales no solo para entender como funcionan las aplicaciones paralelas, sino también para identificar los problemas por los que el rendimiento obtenido dista del esperado, sirviendo como guías para mejorar aquellas deficiencias software y/o hardware que son causas de degradación. No obstante, un análisis en detalle del comportamiento de una aplicación paralela requiere procesar una gran cantidad de datos que crece extremadamente rápido. Los sistemas actuales de gran escala ya comprenden cientos de miles de procesadores, y se espera que los inminentes sistemas exa-escala reunan millones de elementos de procesamiento. Con semejante número de componentes, las estrategias tradicionales de obtención indiscriminada de datos para mejorar la precisión de las herramientas de análisis caerán en desuso debido a las dificultades que entraña almacenarlos y procesarlos. En este aspecto, la evolución de las herramientas sugiere que son necesarios métodos más sofisticados, que incorporen inteligencia para desarrollar la tarea de análisis de manera más competente. Esta tesis aborda el problema de escalabilidad de las herramientas de análisis en sistemas de gran escala, donde es primordial el conocimiento detallado de las interacciones entre todos los componentes para emplear los recursos paralelos de la forma más óptima. Con este fin, esta investigación incluye una revisión exhaustiva de las técnicas que se han aplicado satisfactoriamente para extraer información de grandes volumenes de datos en otras áreas como aprendizaje automático, minería de datos y procesado de señal. Hemos adaptado estas técnicas para mejorar el análisis de aplicaciones paralelas de gran escala, detectando automáticamente patrones repetitivos, correlaciones de datos, tendencias de rendimiento, y demás información relevante. Combinando el uso de estas técnicas, se ha conseguido disminuir el volumen de datos generado durante una ejecución, a la vez que aumentar la cantidad de información útil que se puede extraer de los datos mediante la aplicación de nuevas y más efectivas metodologías de análisis para el estudio del rendimiento de experimentos individuales o en seri
Coupled Kinetic-Fluid Simulations of Ganymede's Magnetosphere and Hybrid Parallelization of the Magnetohydrodynamics Model
The largest moon in the solar system, Ganymede, is the only moon known to possess a strong intrinsic magnetic field.
The interaction between the Jovian plasma and Ganymede's magnetic field creates a mini-magnetosphere with periodically varying upstream conditions, which creates a perfect laboratory in nature for studying magnetic reconnection and magnetospheric physics.
Using the latest version of Space Weather Modeling Framework (SWMF), we study the upstream plasma interactions and dynamics in this subsonic, sub-Alfvénic system.
We have developed a coupled fluid-kinetic Hall Magnetohydrodynamics with embedded Particle-in-Cell (MHD-EPIC) model for Ganymede's magnetosphere, with a self-consistently coupled resistive body representing the electrical properties of the moon's interior, improved inner boundary conditions, and high resolution charge and energy conserved PIC scheme.
I reimplemented the boundary condition setup in SWMF for more versatile control and functionalities, and developed a new user module for Ganymede's simulation.
Results from the models are validated with Galileo magnetometer data of all close encounters and compared with Plasma Subsystem (PLS) data.
The energy fluxes associated with the upstream reconnection in the model is estimated to be about 10^-7 W/cm^2, which accounts for about 40% to the total peak auroral emissions observed by the Hubble Space Telescope.
We find that under steady upstream conditions, magnetopause reconnection in our fluid-kinetic simulations occurs in a non-steady manner.
Flux ropes with length of Ganymede's radius form on the magnetopause at a rate about 3/minute and create spatiotemporal variations in plasma and field properties.
Upon reaching proper grid resolutions, the MHD-EPIC model can resolve both electron and ion kinetics at the magnetopause and show localized crescent shape distribution in both ion and electron phase space, non-gyrotropic and non-isotropic behavior inside the diffusion regions.
The estimated global reconnection rate from the models is about 80 kV with 60% efficiency.
There is weak evidence of minute periodicity in the temporal variations of the reconnection rate due to the dynamic reconnection process.
The requirement of high fidelity results promotes the development of hybrid parallelized numerical model strategy and faster data processing techniques.
The state-of-the-art finite volume/difference MHD code Block Adaptive Tree Solarwind Roe Upwind Scheme (BATS-R-US) was originally designed with pure MPI parallelization.
The maximum problem size achievable was limited by the storage requirements of the block tree structure.
To mitigate this limitation, we have added multithreaded OpenMP parallelization to the previous pure MPI implementation.
We opt to use a coarse-grained approach by making the loops over grid blocks multithreaded and have succeeded in making BATS-R-US an efficient hybrid parallel code with modest changes in the source code while preserving the performance.
Good weak scalings up to 50,0000 and 25,0000 of cores are achieved for the explicit and implicit time stepping schemes, respectively.
This parallelization strategy greatly extends the possible simulation scale by an order of magnitude, and paves the way for future GPU-portable code development.
To improve visualization and data processing, I have developed a whole new data processing workflow with the Julia programming language for efficient data analysis and visualization.
As a summary,
1. I build a single fluid Hall MHD-EPIC model of Ganymede's magnetosphere;
2. I did detailed analysis of the upstream reconnection;
3. I developed a MPI+OpenMP parallel MHD model with BATSRUS;
4. I wrote a package for data analysis and visualization.PHDClimate and Space Sciences and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163032/1/hyzhou_1.pd
Keeping checkpoint/restart viable for exascale systems
Next-generation exascale systems, those capable of performing a quintillion operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoints) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms
Recommended from our members
An Object-Oriented, Python-Based Moving Mesh Hydrodynamics Code Inspired by Astrophysical Problems
The role of radiative cooling plays an important role in the formation of structures in collapsing gas. In this dissertation I examine the impact of cooling in two formation scenarios: first, the role of H2 cooling in collapsing gas in primordial dark matter halos in the possible formation of supermassive black holes; second, low metallicity cooling in collapsing clouds and its possible role in explaining low-metallicity globular clusters. Further, I introduce a new hydrodynamics code, with a design guided by current software principles. In chapter 2, I examined the proposed mechanism to explain the formation of super-massive black holes through direct collapse. The presence of quasars at redshifts z > 6 indicates the existence of supermassive black holes (SMBHs) as massive as a few times 10^9 mass of the sun, challenging models for SMBH formation. One pathway is through the direct collapse of gas in T_vir ≳ 10^4 K halos; however, this requires the suppression of H2 cooling to prevent fragmentation. In this dissertation, I examine a proposed mechanism for this suppression which relies on cold-mode accretion flows leading to shocks at high densities (n > 10^4 cm^−3 ) and temperatures (T > 10^4 K). In such gas, H2 is efficiently collisionally dissociated. I use high-resolution numerical simulations to test this idea, demonstrating that such halos typically have lower temperature progenitors, in which cooling is efficient. Those halos do show filamentary flows; however, the gas shocks at or near the virial radius (at low densities), thus preventing the proposed collisional mechanism from operating. I do find that, if we artificially suppress H2 formation with a high UV background, so as to allow gas in the halo center to enter the high-temperature, high-density “zone of no return”, it will remain there even if the UV flux is turned off, collapsing to high density at high temperature. Due to computational limitations, we simulated only three halos. However, we demonstrate, using Monte Carlo calculations of 10^6 halo merger histories, that a few rare halos could assemble rapidly enough to avoid efficient H2 cooling in all of their progenitor halos, provided that the UV background exceeds J_21 ∼ few at redshifts as high as z ∼ 20. In chapter 3, I explore the relative role of small-scale fragmentation and global collapse in low-metallicity clouds, pointing out that in such clouds the cooling time may be longer than the dynamical time, allowing the cloud to collapse globally before it can fragment. This, I suggest, may help to explain the formation of the low-metallicity globular cluster population, since such dense stellar systems need a large amount of gas to be collected in a small region (without significant feedback during the collapse). To explore this further, I carried out numerical simulations of low-metallicity Bonner-Ebert stable gas clouds, demonstrating that there exists a critical metallicity (between 0.001 and 0.01 metallicity of the sun ) below which the cloud collapses globally without fragmentation. I also run simulations including a background radiative heating source, showing that this can also produce clouds that do not fragment, and that the critical metallicity – which can exceed the no-radiation case – increases with the heating rate. Lastly in chapter 4, I describe the structure and implementation of the new open-source parallel moving-mesh hydrodynamic code, Python Hydro-Dynamics (phd). The code has been written from the ground up to be easy to use and facilitate future modifications. The code is written in a mixture of Python and Cython and makes extensive use of object-oriented programming. I outline the algorithms used and describe the design philosophy and the reasoning of my choices during the code development. I end by validating the code through a series of test problems
Application of clustering analysis and sequence analysis on the performance analysis of parallel applications
High Performance Computing and Supercomputing is the high end area of the computing science that studies and develops the most powerful computers available. Current supercomputers are extremely complex so are the applications that run on them. To take advantage
of the huge amount of computing power available it is strictly necessary to maximize the knowledge we have about how these applications behave and perform. This is the mission of the (parallel) performance analysis.
In general, performance analysis toolkits oUer a very simplistic manipulations of the performance data. First order statistics such as average or standard deviation are used to summarize the values of a given performance metric, hiding in some cases interesting facts available on the raw performance data. For this reason, we require the Performance Analytics, i.e. the application of Data Analytics techniques in the performance analysis area. This thesis contributes with two new techniques to the Performance Analytics Veld.
First contribution is the application of the cluster analysis to detect the parallel application computation structure. Cluster analysis is the unsupervised classiVcation of patterns (observations, data items or feature vectors) into groups (clusters). In this thesis we use the
cluster analysis to group the CPU burst of a parallel application, the regions on each process in-between communication calls or calls to the parallel runtime. The resulting clusters obtained are the diUerent computational trends or phases that appear in the application. These clusters are useful to understand the behaviour of computation part of the application and focus the analyses to those that present performance issues. We demonstrate that our approach requires diUerent clustering algorithms previously used in the area.
Second contribution of the thesis is the application of multiple sequence alignment algorithms to evaluate the computation structure detected. Multiple sequence alignment (MSA) is technique commonly used in bioinformatics to determine the similarities across two or more biological sequences: DNA or roteins. The Cluster Sequence Score we introduce applies a Multiple Sequence Alignment (MSA) algorithm to evaluate the SPMDiness of an application, i.e. how well its computation structure represents the Single Program Multiple Data (SPMD) paradigm structure. We also use this score in the Aggregative Cluster Re-Vnement, a new clustering algorithm we designed, able to detect the SPMD phases of an application at Vne-grain, surpassing the cluster algorithms we used initially. We demonstrate the usefulness of these techniques with three practical uses. The Vrst one is an extrapolation methodology able to maximize the performance metrics that characterize the application phases detected using a single application execution. The second one is the use of the computation structure detected to speedup in a multi-level simulation infrastructure. Finally, we analyse four production-class applications using the computation
characterization to study the impact of possible application improvements and portings of the applications to diUerent hardware conVgurations.
In summary, this thesis proposes the use of cluster analysis and sequence analysis to automatically detect and characterize the diUerent computation trends of a parallel application. These techniques provide the developer / analyst an useful insight of the application performance
and ease the understanding of the application’s behaviour. The contributions of the thesis are not reduced to proposals and publications of the techniques themselves, but also practical uses to demonstrate their usefulness in the analysis task. In addition, the research carried out during these years has provided a production tool for analysing applications’ structure, part of BSC Tools suite