10 research outputs found
Scalable and Reliable Sparse Data Computation on Emergent High Performance Computing Systems
Heterogeneous systems with both CPUs and GPUs have become important system architectures in emergent High Performance Computing (HPC) systems. Heterogeneous systems must address both performance-scalability and power-scalability in the presence of failures. Aggressive power reduction pushes hardware to its operating limit and increases the failure rate. Resilience allows programs to progress when subjected to faults and is an integral component of large-scale systems, but incurs significant time and energy overhead. The future exascale systems are expected to have higher power consumption with higher fault rates. Sparse data computation is the fundamental kernel in many scientific applications. It is suitable for the studies of scalability and resilience on heterogeneous systems due to its computational characteristics.
To deliver the promised performance within the given power budget, heterogeneous computing mandates a deep understanding of the interplay between scalability and resilience. Managing scalability and resilience is challenging in heterogeneous systems, due to the heterogeneous compute capability, power consumption, and varying failure rates between CPUs and GPUs. Scalability and resilience have been traditionally studied in isolation, and optimizing one typically detrimentally impacts the other. While prior works have been proved successful in optimizing scalability and resilience on CPU-based homogeneous systems, simply extending current approaches to heterogeneous systems results in suboptimal performance-scalability and/or power-scalability.
To address the above multiple research challenges, we propose novel resilience and energy-efficiency technologies to optimize scalability and resilience for sparse data computation on heterogeneous systems with CPUs and GPUs. First, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes, and develop and prototype performance optimization and power management strategies to improve scalability for sparse linear solvers. Our results quantitatively reveal that each resilience scheme has its own advantages depending on the fault rate, system size, and power budget, and the forward recovery can further benefit from our performance and power optimizations for large-scale computing. Second, we design a novel resilience technique that relaxes the requirement of synchronization and identicalness for processes, and allows them to run in heterogeneous resources with power reduction. Our results show a significant reduction in energy for unmodified programs in various fault situations compared to exact replication techniques. Third, we propose a novel distributed sparse tensor decomposition that utilizes an asynchronous RDMA-based approach with OpenSHMEM to improve scalability on large-scale systems and prove that our method works well in heterogeneous systems. Our results show our irregularity-aware workload partition and balanced-asynchronous algorithms are scalable and outperform the state-of-the-art distributed implementations. We demonstrate that understanding different bottlenecks for various types of tensors plays critical roles in improving scalability
Parallel programming systems for scalable scientific computing
High-performance computing (HPC) systems are more powerful than ever before. However, this rise in performance brings with it greater complexity, presenting significant challenges for researchers who wish to use these systems for their scientific work. This dissertation explores the development of scalable programming solutions for scientific computing. These solutions aim to be effective across a diverse range of computing platforms, from personal desktops to advanced supercomputers.To better understand HPC systems, this dissertation begins with a literature review on exascale supercomputers, massive systems capable of performing 10¹⁸ floating-point operations per second. This review combines both manual and data-driven analyses, revealing that while traditional challenges of exascale computing have largely been addressed, issues like software complexity and data volume remain. Additionally, the dissertation introduces the open-source software tool (called LitStudy) developed for this research.Next, this dissertation introduces two novel programming systems. The first system (called Rocket) is designed to scale all-versus-all algorithms to massive datasets. It features a multi-level software-based cache, a divide-and-conquer approach, hierarchical work-stealing, and asynchronous processing to maximize data reuse, exploit data locality, dynamically balance workloads, and optimize resource utilization. The second system (called Lightning) aims to scale existing single-GPU kernel functions across multiple GPUs, even on different nodes, with minimal code adjustments. Results across eight benchmarks on up to 32 GPUs show excellent scalability.The dissertation concludes by proposing a set of design principles for developing parallel programming systems for scalable scientific computing. These principles, based on lessons from this PhD research, represent significant steps forward in enabling researchers to efficiently utilize HPC systems
Resilience for Asynchronous Iterative Methods for Sparse Linear Systems
Large scale simulations are used in a variety of application areas in science and engineering to help forward the progress of innovation. Many spend the vast majority of their computational time attempting to solve large systems of linear equations; typically arising from discretizations of partial differential equations that are used to mathematically model various phenomena. The algorithms used to solve these problems are typically iterative in nature, and making efficient use of computational time on High Performance Computing (HPC) clusters involves constantly improving these iterative algorithms. Future HPC platforms are expected to encounter three main problem areas: scalability of code, reliability of hardware, and energy efficiency of the platform. The HPC resources that are expected to run the large programs are planned to consist of billions of processing units that come from more traditional multicore processors as well as a variety of different hardware accelerators. This growth in parallelism leads to the presence of all three problems.
Previously, work on algorithm development has focused primarily on creating fault tolerance mechanisms for traditional iterative solvers. Recent work has begun to revisit using asynchronous methods for solving large scale applications, and this dissertation presents research into fault tolerance for fine-grained methods that are asynchronous in nature. Classical convergence results for asynchronous methods are revisited and modified to account for the possible occurrence of a fault, and a variety of techniques for recovery from the effects of a fault are proposed. Examples of how these techniques can be used are shown for various algorithms, including an analysis of a fine-grained algorithm for computing incomplete factorizations. Lastly, numerous modeling and simulation tools for the further construction of iterative algorithms for HPC applications are developed, including numerical models for simulating faults and a simulation framework that can be used to extrapolate the performance of algorithms towards future HPC systems
High Performance Fault-Tolerant Solution of PDEs using the Sparse Grid Combination Technique
The data volume of Partial Differential Equation (PDE) based
ultra-large-scale scientific simulations is increasing at a
higher rate than that of the system’s processing power. To
process the increased amount of simulation data within a
reasonable amount of time, the evolution of computation is
expected to reach the exascale level. One of several key
challenges to overcome in these exascale systems is to handle the
high rate of component failure arising due to having millions of
cores working together with high power consumption and clock
frequencies. Studies show that even the highly tuned widely used
checkpointing technique is unable to handle the failures
efficiently in exascale systems. The Sparse Grid Combination
Technique (SGCT) is proved to be a cost-effective method for
computing high-dimensional PDE based simulations with only small
loss of accuracy, which can be easily modified to provide an
Algorithm-Based Fault Tolerance (ABFT) for these applications.
Additionally, the recently introduced User Level Failure
Mitigation (ULFM) MPI library provides the ability to detect and
identify application process failures, and reconstruct the failed
processes. However, there is a gap of the research how these
could be integrated together to develop fault-tolerant
applications, and the range of issues that may arise in the
process are yet to be revealed.
My thesis is that with suitable infrastructural support an
integration of ULFM MPI and a modified form of the SGCT can be
used to create high performance robust PDE based applications.
The key contributions of my thesis are: (1) An evaluation of the
effectiveness of applying the modified version of the SGCT on
three existing and complex applications (including a general
advection solver) to make them highly fault-tolerant. (2) An
evaluation of the capabilities of ULFM MPI to recover from a
single or multiple real process/node failures for a range of
complex applications computed with the modified form of the SGCT.
(3) A detailed experimental evaluation of the fault-tolerant work
including the time and space requirements, and parallelization on
the non-SGCT dimensions. (4) An analysis of the result errors
with respect to the number of failures. (5) An analysis of the
ABFT and recovery overheads. (6) An in-depth comparison of the
fault-tolerant SGCT based ABFT with traditional checkpointing on
a non-fault-tolerant SGCT based application. (7) A detailed
evaluation of the infrastructural support in terms of load
balancing, pure- and hybrid-MPI, process layouts, processor
affinity, and so on
Software for Exascale Computing - SPPEXA 2016-2019
This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest