32 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationCurrent scaling trends in transistor technology, in pursuit of larger component counts and improving power efficiency, are making the hardware increasingly less reliable. Due to extreme transistor miniaturization, it is becoming easier to flip a bit stored in memory elements built using these transistors. Given that soft errors can cause transient bit-flips in memory elements, caused due to alpha particles and cosmic rays striking those elements, soft errors have become one of the major impediments in system resilience as we move towards exascale computing. Soft errors escaping the hardware-layer may silently corrupt the runtime application data of a program, causing silent data corruption in the output. Also, given that soft errors are transient in nature, it is notoriously hard to trace back their origins. Therefore, techniques to enhance system resilience hinge on the availability of efficient error detectors that have high detection rates, low false positive rates, and lower computational overhead. It is equally important to have a flexible infrastructure capable of simulating realistic soft error models to promote an effective evaluation of newly developed error detectors. In this work, we present a set of techniques for efficiently detecting soft errors affecting control-flow, data, and structured address computations in an application. We evaluate the efficacy of the proposed techniques by evaluating them on a collection of benchmarks through fault-injection driven studies. As an important requirement, we also introduce two new LLVM-based fault injectors, KULFI and VULFI, which are geared towards scalar and vector architectures, respectively. Through this work, we aim to make contributions to the system resilience community by making our research tools (in the form of error detectors and fault injectors) publicly available

    Laboratory Directed Research and Development FY-10 Annual Report

    Full text link
    The FY 2010 Laboratory Directed Research and Development (LDRD) Annual Report is a compendium of the diverse research performed to develop and ensure the INL's technical capabilities can support the future DOE missions and national research priorities. LDRD is essential to the INL -- it provides a means for the laboratory to pursue novel scientific and engineering research in areas that are deemed too basic or risky for programmatic investments. This research enhances technical capabilities at the laboratory, providing scientific and engineering staff with opportunities for skill building and partnership development

    Scaling and Resilience in Numerical Algorithms for Exascale Computing

    Get PDF
    The first Petascale supercomputer, the IBM Roadrunner, went online in 2008. Ten years later, the community is now looking ahead to a new generation of Exascale machines. During the decade that has passed, several hundred Petascale capable machines have been installed worldwide, yet despite the abundance of machines, applications that scale to their full size remain rare. Large clusters now routinely have 50.000+ cores, some have several million. This extreme level of parallelism, that has allowed a theoretical compute capacity in excess of a million billion operations per second, turns out to be difficult to use in many applications of practical interest. Processors often end up spending more time waiting for synchronization, communication, and other coordinating operations to complete, rather than actually computing. Component reliability is another challenge facing HPC developers. If even a single processor fail, among many thousands, the user is forced to restart traditional applications, wasting valuable compute time. These issues collectively manifest themselves as low parallel efficiency, resulting in waste of energy and computational resources. Future performance improvements are expected to continue to come in large part due to increased parallelism. One may therefore speculate that the difficulties currently faced, when scaling applications to Petascale machines, will progressively worsen, making it difficult for scientists to harness the full potential of Exascale computing. The thesis comprises two parts. Each part consists of several chapters discussing modifications of numerical algorithms to make them better suited for future Exascale machines. In the first part, the use of Parareal for Parallel-in-Time integration techniques for scalable numerical solution of partial differential equations is considered. We propose a new adaptive scheduler that optimize the parallel efficiency by minimizing the time-subdomain length without making communication of time-subdomains too costly. In conjunction with an appropriate preconditioner, we demonstrate that it is possible to obtain time-parallel speedup on the nonlinear shallow water equation, beyond what is possible using conventional spatial domain-decomposition techniques alone. The part is concluded with the proposal of a new method for constructing Parallel-in-Time integration schemes better suited for convection dominated problems. In the second part, new ways of mitigating the impact of hardware failures are developed and presented. The topic is introduced with the creation of a new fault-tolerant variant of Parareal. In the chapter that follows, a C++ Library for multi-level checkpointing is presented. The library uses lightweight in-memory checkpoints, protected trough the use of erasure codes, to mitigate the impact of failures by decreasing the overhead of checkpointing and minimizing the compute work lost. Erasure codes have the unfortunate property that if more data blocks are lost than parity codes created, the data is effectively considered unrecoverable. The final chapter contains a preliminary study on partial information recovery for incomplete checksums. Under the assumption that some meta knowledge exists on the structure of the data encoded, we show that the data lost may be recovered, at least partially. This result is of interest not only in HPC but also in data centers where erasure codes are widely used to protect data efficiently

    PROGRAM, THE NEBRASKA ACADEMY OF SCIENCES: One Hundred-Thirty-First Annual Meeting, APRIL 23-24, 2021. ONLINE

    Get PDF
    AFFILIATED SOCIETIES OF THE NEBRASKA ACADEMY OF SCIENCES, INC. 1.American Association of Physics Teachers, Nebraska Section: Web site: http://www.aapt.org/sections/officers.cfm?section=Nebraska 2.Friends of Loren Eiseley: Web site: http://www.eiseley.org/ 3.Lincoln Gem & Mineral Club: Web site: http://www.lincolngemmineralclub.org/ 4.Nebraska Chapter, National Council for Geographic Education 5.Nebraska Geological Society: Web site: http://www.nebraskageologicalsociety.org Sponsors of a $50 award to the outstanding student paper presented at the Nebraska Academy of SciencesAnnual Meeting, Earth Science /Nebraska Chapter, National Council Sections 6.Nebraska Graduate Women in Science 7.Nebraska Junior Academy of Sciences: Web site: http://www.nebraskajunioracademyofsciences.org/ 8.Nebraska Ornithologists’ Union: Web site: http://www.noubirds.org/ 9.Nebraska Psychological Association: http://www.nebpsych.org/ 10.Nebraska-Southeast South Dakota Section Mathematical Association of America: Web site: http://sections.maa.org/nesesd/ 11.Nebraska Space Grant Consortium: Web site: http://www.ne.spacegrant.org/ CONTENTS AERONAUTICS & SPACE SCIENCE ANTHROPOLOGY APPLIED SCIENCE & TECHNOLOGY BIOLOGICAL & MEDICAL SCIENCES COLLEGIATE ACADEMY: BIOLOGY COLLEGIATE ACADEMY: CHEMISTRY & PHYSICS EARTH SCIENCES ENVIRONMENTAL SCIENCES GENERAL CHEMISTRY GENERAL PHYSICS TEACHING OF SCIENCE & MATHEMATICS 2020-2021 PROGRAM COMMITTEE 2020-2021 EXECUTIVE COMMITTEE FRIENDS OF THE ACADEMY NEBRASKA ACADEMY OF SCIENCS FRIEND OF SCIENCE AWARD WINNERS FRIEND OF SCIENCE AWARD TO DR PAUL KAR

    Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection

    No full text

    2022 Review of Data-Driven Plasma Science

    Get PDF
    Data-driven science and technology offer transformative tools and methods to science. This review article highlights the latest development and progress in the interdisciplinary field of data-driven plasma science (DDPS), i.e., plasma science whose progress is driven strongly by data and data analyses. Plasma is considered to be the most ubiquitous form of observable matter in the universe. Data associated with plasmas can, therefore, cover extremely large spatial and temporal scales, and often provide essential information for other scientific disciplines. Thanks to the latest technological developments, plasma experiments, observations, and computation now produce a large amount of data that can no longer be analyzed or interpreted manually. This trend now necessitates a highly sophisticated use of high-performance computers for data analyses, making artificial intelligence and machine learning vital components of DDPS. This article contains seven primary sections, in addition to the introduction and summary. Following an overview of fundamental data-driven science, five other sections cover widely studied topics of plasma science and technologies, i.e., basic plasma physics and laboratory experiments, magnetic confinement fusion, inertial confinement fusion and high-energy-density physics, space and astronomical plasmas, and plasma technologies for industrial and other applications. The final section before the summary discusses plasma-related databases that could significantly contribute to DDPS. Each primary section starts with a brief introduction to the topic, discusses the state-of-the-art developments in the use of data and/or data-scientific approaches, and presents the summary and outlook. Despite the recent impressive signs of progress, the DDPS is still in its infancy. This article attempts to offer a broad perspective on the development of this field and identify where further innovations are required
    corecore