32 research outputs found
Doctor of Philosophy
dissertationCurrent scaling trends in transistor technology, in pursuit of larger component counts and improving power efficiency, are making the hardware increasingly less reliable. Due to extreme transistor miniaturization, it is becoming easier to flip a bit stored in memory elements built using these transistors. Given that soft errors can cause transient bit-flips in memory elements, caused due to alpha particles and cosmic rays striking those elements, soft errors have become one of the major impediments in system resilience as we move towards exascale computing. Soft errors escaping the hardware-layer may silently corrupt the runtime application data of a program, causing silent data corruption in the output. Also, given that soft errors are transient in nature, it is notoriously hard to trace back their origins. Therefore, techniques to enhance system resilience hinge on the availability of efficient error detectors that have high detection rates, low false positive rates, and lower computational overhead. It is equally important to have a flexible infrastructure capable of simulating realistic soft error models to promote an effective evaluation of newly developed error detectors. In this work, we present a set of techniques for efficiently detecting soft errors affecting control-flow, data, and structured address computations in an application. We evaluate the efficacy of the proposed techniques by evaluating them on a collection of benchmarks through fault-injection driven studies. As an important requirement, we also introduce two new LLVM-based fault injectors, KULFI and VULFI, which are geared towards scalar and vector architectures, respectively. Through this work, we aim to make contributions to the system resilience community by making our research tools (in the form of error detectors and fault injectors) publicly available
Recommended from our members
High-fidelity error injection and acceleration techniques
As technology scales down, the likelihood of hardware errors that silently corrupt the results of applications is increasing. Evaluating the resilience of applications against hardware errors is thus of significant concern. Current evaluation techniques via error injection are either low-fidelity or inefficient in terms of using computing resources. This dissertation demonstrates that sophisticated integration of injectors across abstraction layers and novel sampling algorithms can significantly improve both the fidelity and efficiency. Specifically, this dissertation describes an open-source instruction-level error injector that generates high-fidelity hardware errors due to particle strikes and voltage droops. Two acceleration techniques, nested Monte Carlo and Injection-Point Overprovisioning, are proposed to speed up error injection campaigns by 1−2 orders of magnitude. This dissertation also answers the question of when high-fidelity is needed to evaluate the impact of hardware errors on applications and the effectiveness of error detectors.Electrical and Computer Engineerin
Laboratory Directed Research and Development FY-10 Annual Report
The FY 2010 Laboratory Directed Research and Development (LDRD) Annual Report is a compendium of the diverse research performed to develop and ensure the INL's technical capabilities can support the future DOE missions and national research priorities. LDRD is essential to the INL -- it provides a means for the laboratory to pursue novel scientific and engineering research in areas that are deemed too basic or risky for programmatic investments. This research enhances technical capabilities at the laboratory, providing scientific and engineering staff with opportunities for skill building and partnership development
Scaling and Resilience in Numerical Algorithms for Exascale Computing
The first Petascale supercomputer, the IBM Roadrunner, went online in 2008. Ten years later, the community is now looking ahead to a new generation of Exascale machines. During the decade that has passed, several hundred Petascale capable machines have been installed worldwide, yet despite the abundance of machines, applications that scale to their full size remain rare. Large clusters now routinely have 50.000+ cores, some have several million. This extreme level of parallelism, that has allowed a theoretical compute capacity in excess of a million billion operations per second, turns out to be difficult to use in many applications of practical interest. Processors often end up spending more time waiting for synchronization, communication, and other coordinating operations to complete, rather than actually computing. Component reliability is another challenge facing HPC developers. If even a single processor fail, among many thousands, the user is forced to restart traditional applications, wasting valuable compute time. These issues collectively manifest themselves as low parallel efficiency, resulting in waste of energy and computational resources. Future performance improvements are expected to continue to come in large part due to increased parallelism. One may therefore speculate that the difficulties currently faced, when scaling applications to Petascale machines, will progressively worsen, making it difficult for scientists to harness the full potential of Exascale computing.
The thesis comprises two parts. Each part consists of several chapters discussing modifications of numerical algorithms to make them better suited for future Exascale machines. In the first part, the use of Parareal for Parallel-in-Time integration techniques for scalable numerical solution of partial differential equations is considered. We propose a new adaptive scheduler that optimize the parallel efficiency by minimizing the time-subdomain length without making communication of time-subdomains too costly. In conjunction with an appropriate preconditioner, we demonstrate that it is possible to obtain time-parallel speedup on the nonlinear shallow water equation, beyond what is possible using conventional spatial domain-decomposition techniques alone. The part is concluded with the proposal of a new method for constructing Parallel-in-Time integration schemes better suited for convection dominated problems.
In the second part, new ways of mitigating the impact of hardware failures are developed and presented. The topic is introduced with the creation of a new fault-tolerant variant of Parareal. In the chapter that follows, a C++ Library for multi-level checkpointing is presented. The library uses lightweight in-memory checkpoints, protected trough the use of erasure codes, to mitigate the impact of failures by decreasing the overhead of checkpointing and minimizing the compute work lost. Erasure codes have the unfortunate property that if more data blocks are lost than parity codes created, the data is effectively considered unrecoverable. The final chapter contains a preliminary study on partial information recovery for incomplete checksums. Under the assumption that some meta knowledge exists on the structure of the data encoded, we show that the data lost may be recovered, at least partially. This result is of interest not only in HPC but also in data centers where erasure codes are widely used to protect data efficiently
PROGRAM, THE NEBRASKA ACADEMY OF SCIENCES: One Hundred-Thirty-First Annual Meeting, APRIL 23-24, 2021. ONLINE
AFFILIATED SOCIETIES OF THE NEBRASKA ACADEMY OF SCIENCES, INC.
1.American Association of Physics Teachers, Nebraska Section: Web site: http://www.aapt.org/sections/officers.cfm?section=Nebraska
2.Friends of Loren Eiseley: Web site: http://www.eiseley.org/
3.Lincoln Gem & Mineral Club: Web site: http://www.lincolngemmineralclub.org/
4.Nebraska Chapter, National Council for Geographic Education
5.Nebraska Geological Society: Web site: http://www.nebraskageologicalsociety.org Sponsors of a $50 award to the outstanding student paper presented at the Nebraska Academy of SciencesAnnual Meeting, Earth Science /Nebraska Chapter, National Council Sections
6.Nebraska Graduate Women in Science
7.Nebraska Junior Academy of Sciences: Web site: http://www.nebraskajunioracademyofsciences.org/
8.Nebraska Ornithologists’ Union: Web site: http://www.noubirds.org/
9.Nebraska Psychological Association: http://www.nebpsych.org/
10.Nebraska-Southeast South Dakota Section Mathematical Association of America: Web site: http://sections.maa.org/nesesd/
11.Nebraska Space Grant Consortium: Web site: http://www.ne.spacegrant.org/
CONTENTS
AERONAUTICS & SPACE SCIENCE
ANTHROPOLOGY
APPLIED SCIENCE & TECHNOLOGY
BIOLOGICAL & MEDICAL SCIENCES
COLLEGIATE ACADEMY: BIOLOGY
COLLEGIATE ACADEMY: CHEMISTRY & PHYSICS
EARTH SCIENCES
ENVIRONMENTAL SCIENCES
GENERAL CHEMISTRY
GENERAL PHYSICS
TEACHING OF SCIENCE & MATHEMATICS
2020-2021 PROGRAM COMMITTEE
2020-2021 EXECUTIVE COMMITTEE
FRIENDS OF THE ACADEMY
NEBRASKA ACADEMY OF SCIENCS FRIEND OF SCIENCE AWARD WINNERS
FRIEND OF SCIENCE AWARD TO DR PAUL KAR
2022 Review of Data-Driven Plasma Science
Data-driven science and technology offer transformative tools and methods to science. This review article highlights the latest development and progress in the interdisciplinary field of data-driven plasma science (DDPS), i.e., plasma science whose progress is driven strongly by data and data analyses. Plasma is considered to be the most ubiquitous form of observable matter in the universe. Data associated with plasmas can, therefore, cover extremely large spatial and temporal scales, and often provide essential information for other scientific disciplines. Thanks to the latest technological developments, plasma experiments, observations, and computation now produce a large amount of data that can no longer be analyzed or interpreted manually. This trend now necessitates a highly sophisticated use of high-performance computers for data analyses, making artificial intelligence and machine learning vital components of DDPS. This article contains seven primary sections, in addition to the introduction and summary. Following an overview of fundamental data-driven science, five other sections cover widely studied topics of plasma science and technologies, i.e., basic plasma physics and laboratory experiments, magnetic confinement fusion, inertial confinement fusion and high-energy-density physics, space and astronomical plasmas, and plasma technologies for industrial and other applications. The final section before the summary discusses plasma-related databases that could significantly contribute to DDPS. Each primary section starts with a brief introduction to the topic, discusses the state-of-the-art developments in the use of data and/or data-scientific approaches, and presents the summary and outlook. Despite the recent impressive signs of progress, the DDPS is still in its infancy. This article attempts to offer a broad perspective on the development of this field and identify where further innovations are required