75 research outputs found

    Reliable High Performance Peta- and Exa-Scale Computing

    Get PDF
    As supercomputers become larger and more powerful, they are growing increasingly complex. This is reflected both in the exponentially increasing numbers of components in HPC systems (LLNL is currently installing the 1.6 million core Sequoia system) as well as the wide variety of software and hardware components that a typical system includes. At this scale it becomes infeasible to make each component sufficiently reliable to prevent regular faults somewhere in the system or to account for all possible cross-component interactions. The resulting faults and instability cause HPC applications to crash, perform sub-optimally or even produce erroneous results. As supercomputers continue to approach Exascale performance and full system reliability becomes prohibitively expensive, we will require novel techniques to bridge the gap between the lower reliability provided by hardware systems and users unchanging need for consistent performance and reliable results. Previous research on HPC system reliability has developed various techniques for tolerating and detecting various types of faults. However, these techniques have seen very limited real applicability because of our poor understanding of how real systems are affected by complex faults such as soft fault-induced bit flips or performance degradations. Prior work on such techniques has had very limited practical utility because it has generally focused on analyzing the behavior of entire software/hardware systems both during normal operation and in the face of faults. Because such behaviors are extremely complex, such studies have only produced coarse behavioral models of limited sets of software/hardware system stacks. Since this provides little insight into the many different system stacks and applications used in practice, this work has had little real-world impact. My project addresses this problem by developing a modular methodology to analyze the behavior of applications and systems during both normal and faulty operation. By synthesizing models of individual components into a whole-system behavior models my work is making it possible to automatically understand the behavior of arbitrary real-world systems to enable them to tolerate a wide range of system faults. My project is following a multi-pronged research strategy. Section II discusses my work on modeling the behavior of existing applications and systems. Section II.A discusses resilience in the face of soft faults and Section II.B looks at techniques to tolerate performance faults. Finally Section III presents an alternative approach that studies how a system should be designed from the ground up to make resilience natural and easy

    Soft Error Vulnerability of Iterative Linear Algebra Methods

    Get PDF
    Devices are increasingly vulnerable to soft errors as their feature sizes shrink. Previously, soft error rates were significant primarily in space and high-atmospheric computing. Modern architectures now use features so small at sufficiently low voltages that soft errors are becoming important even at terrestrial altitudes. Due to their large number of components, supercomputers are particularly susceptible to soft errors. Since many large scale parallel scientific applications use iterative linear algebra methods, the soft error vulnerability of these methods constitutes a large fraction of the applications overall vulnerability. Many users consider these methods invulnerable to most soft errors since they converge from an imprecise solution to a precise one. However, we show in this paper that iterative methods are vulnerable to soft errors, exhibiting both silent data corruptions and poor ability to detect errors. Further, we evaluate a variety of soft error detection and tolerance techniques, including checkpointing, linear matrix encodings, and residual tracking techniques

    Soft Error Vulnerability of Iterative Linear Algebra Methods

    Get PDF
    Devices become increasingly vulnerable to soft errors as their feature sizes shrink. Previously, soft errors primarily caused problems for space and high-atmospheric computing applications. Modern architectures now use features so small at sufficiently low voltages that soft errors are becoming significant even at terrestrial altitudes. The soft error vulnerability of iterative linear algebra methods, which many scientific applications use, is a critical aspect of the overall application vulnerability. These methods are often considered invulnerable to many soft errors because they converge from an imprecise solution to a precise one. However, we show that iterative methods can be vulnerable to soft errors, with a high rate of silent data corruptions. We quantify this vulnerability, with algorithms generating up to 8.5% erroneous results when subjected to a single bit-flip. Further, we show that detecting soft errors in an iterative method depends on its detailed convergence properties and requires more complex mechanisms than simply checking the residual. Finally, we explore inexpensive techniques to tolerate soft errors in these methods

    CLOMP: Accurately Characterizing OpenMP Application Overheads

    Get PDF
    Despite its ease of use, OpenMP has failed to gain widespread use on large scale systems, largely due to its failure to deliver sufficient performance. Our experience indicates that the cost of initiating OpenMP regions is simply too high for the desired OpenMP usage scenario of many applications. In this paper, we introduce CLOMP, a new benchmark to characterize this aspect of OpenMP implementations accurately. CLOMP complements the existing EPCC benchmark suite to provide simple, easy to understand measurements of OpenMP overheads in the context of application usage scenarios. Our results for several OpenMP implementations demonstrate that CLOMP identifies the amount of work required to compensate for the overheads observed with EPCC. Further, we show that CLOMP also captures limitations for OpenMP parallelization on NUMA systems

    Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System

    Get PDF
    High-performance computing (HPC) systems are growing more powerful by utilizing more hardware components. As the system mean-time-before-failure correspondingly drops, applications must checkpoint more frequently to make progress. However, as the system memory sizes grow faster than the bandwidth to the parallel file system, the cost of checkpointing begins to dominate application run times. A potential solution to this problem is to use multi-level checkpointing, which employs multiple types of checkpoints with different costs and different levels of resiliency in a single run. The goal is to design light-weight checkpoints to handle the most common failure modes and rely on more expensive checkpoints for less common, but more severe failures. While this approach is theoretically promising, it has not been fully evaluated in a large-scale, production system context. To this end we have designed a system, called the Scalable Checkpoint/Restart (SCR) library, that writes checkpoints to storage on the compute nodes utilizing RAM, Flash, or disk, in addition to the parallel file system. We present the performance and reliability properties of SCR as well as a probabilistic Markov model that predicts its performance on current and future systems. We show that multi-level checkpointing improves efficiency on existing large-scale systems and that this benefit increases as the system size grows. In particular, we developed low-cost checkpoint schemes that are 100x-1000x faster than the parallel file system and effective against 85% of our system failures. This leads to a gain in machine efficiency of up to 35%, and it reduces the the load on the parallel file system by a factor of two on current and future systems

    Mammalian microRNA: an important modulator of host-pathogen interactions in human viral infections

    Get PDF
    MicroRNAs (miRNAs), which are small non-coding RNAs expressed by almost all metazoans, have key roles in the regulation of cell differentiation, organism development and gene expression. Thousands of miRNAs regulating approximately 60ĂŠ% of the total human genome have been identified. They regulate genetic expression either by direct cleavage or by translational repression of the target mRNAs recognized through partial complementary base pairing. The active and functional unit of miRNA is its complex with Argonaute proteins known as the microRNA-induced silencing complex (miRISC). De-regulated miRNA expression in the human cell may contribute to a diverse group of disorders including cancer, cardiovascular dysfunctions, liver damage, immunological dysfunction, metabolic syndromes and pathogenic infections. Current day studies have revealed that miRNAs are indeed a pivotal component of host-pathogen interactions and host immune responses toward microorganisms. miRNA is emerging as a tool for genetic study, therapeutic development and diagnosis for human pathogenic infections caused by viruses, bacteria, parasites and fungi. Many pathogens can exploit the host miRNA system for their own benefit such as surviving inside the host cell, replication, pathogenesis and bypassing some host immune barriers, while some express pathogen-encoded miRNA inside the host contributing to their replication, survival and/or latency. In this review, we discuss the role and significance of miRNA in relation to some pathogenic viruses

    Glia-to-neuron transfer of miRNAs via extracellular vesicles: a new mechanism underlying inflammation-induced synaptic alterations

    Get PDF
    Recent evidence indicates synaptic dysfunction as an early mechanism affected in neuroinflammatory diseases, such as multiple sclerosis, which are characterized by chronic microglia activation. However, the mode(s) of action of reactive microglia in causing synaptic defects are not fully understood. In this study, we show that inflammatory microglia produce extracellular vesicles (EVs) which are enriched in a set of miRNAs that regulate the expression of key synaptic proteins. Among them, miR-146a-5p, a microglia-specific miRNA not present in hippocampal neurons, controls the expression of presynaptic synaptotagmin1 (Syt1) and postsynaptic neuroligin1 (Nlg1), an adhesion protein which play a crucial role in dendritic spine formation and synaptic stability. Using a Renilla-based sensor, we provide formal proof that inflammatory EVs transfer their miR-146a-5p cargo to neuron. By western blot and immunofluorescence analysis we show that vesicular miR-146a-5p suppresses Syt1 and Nlg1 expression in receiving neurons. Microglia-to-neuron miR-146a-5p transfer and Syt1 and Nlg1 downregulation do not occur when EV\ue2\u80\u93neuron contact is inhibited by cloaking vesicular phosphatidylserine residues and when neurons are exposed to EVs either depleted of miR-146a-5p, produced by pro-regenerative microglia, or storing inactive miR-146a-5p, produced by cells transfected with an anti-miR-146a-5p. Morphological analysis reveals that prolonged exposure to inflammatory EVs leads to significant decrease in dendritic spine density in hippocampal neurons in vivo and in primary culture, which is rescued in vitro by transfection of a miR-insensitive Nlg1 form. Dendritic spine loss is accompanied by a decrease in the density and strength of excitatory synapses, as indicated by reduced mEPSC frequency and amplitude. These findings link inflammatory microglia and enhanced EV production to loss of excitatory synapses, uncovering a previously unrecognized role for microglia-enriched miRNAs, released in association to EVs, in silencing of key synaptic genes
    • 

    corecore