995 research outputs found

    Resiliency in numerical algorithm design for extreme scale simulations

    Get PDF
    This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.Peer Reviewed"Article signat per 36 autors/es: Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik G ̈oddeke, Marco Heisig, Fabienne Jezequel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Ortiz, Francesco Rizzi, Ulrich Rude, Martin Schulz, Fred Fung, Robert Speck, Linda Stals, Keita Teranishi, Samuel Thibault, Dominik Thonnes, Andreas Wagner and Barbara Wohlmuth"Postprint (author's final draft

    Real-Time Fault Detection and Diagnosis Using Intelligent Monitoring and Supervision Systems

    Get PDF
    In monitoring and supervision schemes, fault detection and diagnosis characterize high efficiency and quality production systems. To achieve such properties, these structures are based on techniques that allow detection and diagnosis of failures in real time. Detection signals faults and diagnostics provide the root cause and location. Fault detection is based on signal and process mathematical models, while fault diagnosis is focused on systems theory and process modeling. Monitoring and supervision complement each other in fault management, thus enabling normal and continuous operation. Its application avoids stopping productive processes by early detection of failures and by applying real-time actions to eliminate them, such as predictive and proactive maintenance based on process conditions. The integration of all these methodologies enables intelligent monitoring and supervision systems, enabling real-time fault detection and diagnosis. Their high performance is associated with statistical decision-making techniques, expert systems, artificial neural networks, fuzzy logic and computational procedures, making them efficient and fully autonomous in making decisions in the real-time operation of a production system

    Real-time Monitoring of Low Voltage Grids using Adaptive Smart Meter Data Collection

    Get PDF

    Improved grid interaction of photovoltaics using smart micro-inverters

    Get PDF

    Improved grid interaction of photovoltaics using smart micro-inverters

    Get PDF

    Degradation modeling and degradation-aware control of power electronic systems

    Get PDF
    The power electronics market is valued at 23.25billionin2019andisprojectedtoreach23.25 billion in 2019 and is projected to reach 36.64 billion by 2027. Power electronic systems (PES) have been extensively used in a wide range of critical applications, including automotive, renewable energy, industrial variable-frequency drive, etc. Thus, the PESs\u27 reliability and robustness are immensely important for the smooth operation of mission-critical applications. Power semiconductor switches are one of the most vulnerable components in the PES. The vulnerability of these switches impacts the reliability and robustness of the PES. Thus, switch-health monitoring and prognosis are critical for avoiding unexpected shutdowns and preventing catastrophic failures. The importance of the prognosis study increases dramatically with the growing popularity of the next-generation power semiconductor switches, wide bandgap switches. These switches show immense promise in the high-power high-frequency operations due to their higher breakdown voltage and lower switch loss. But their wide adaptation is limited by the inadequate reliability study. A thorough prognosis study comprising switch degradation modeling, remaining useful life (RUL) estimation, and degradation-aware controller development, is important to enhance the PESs\u27 robustness, especially with wide bandgap switches. In this dissertation, three studies are conducted to achieve these objectives- 1) Insulated Gate Bipolar Transistor (IGBT) degradation modeling and RUL estimation, 2) cascode Gallium Nitride (GaN) Field-Effect Transistor (FET) degradation modeling and RUL estimation, and 3) Degradation-aware controller design for a PES, solid-state transformer (SST). The first two studies have addressed the significant variation in RUL estimation and proposed degradation identification methods for IGBT and cascode GaN FET. In the third study, a system-level integration of the switch degradation model is implemented in the SST. The insight into the switch\u27s degradation pattern from the first two studies is integrated into developing a degradation-aware controller for the SST. State-of-the-art controllers do not consider the switch degradation that results in premature system failure. The proposed low-complexity degradation-aware and adaptive SST controller ensures optimal degradation-aware power transfer and robust operation over the lifetime

    RELIABILITY CENTERED MAINTENANCE (RCM) FOR ASSET MANAGEMENT IN ELECTRIC POWER DISTRIBUTION SYSTEM

    Get PDF
    The purpose of Maintenance is to extend equipment life time or at least the mean time to the next failure. Asset Maintenance, which is part of asset management, incurs expenditure but could result in very costly consequences if not performed or performed too little. It may not even be economical to perform it too frequently. The decision therefore, to eliminate or minimize the risk of equipment failure must not be based on trial and error as it was done in the past. In this thesis, an enhanced Reliability-Centered Maintenance (RCM) methodology that is based on a quantitative relationship between preventive maintenance (PM) performed at system component level and the overall system reliability was applied to identify the distribution components that are critical to system reliability. Maintenance model relating probability of failure to maintenance activity was developed for maintainable distribution components. The Markov maintenance Model developed was then used to predict the remaining life of transformer insulation for a selected distribution system. This Model incorporates various levels of insulation deterioration and minor maintenance state. If current state of insulation ageing is assumed from diagnostic testing and inspection, the Model is capable of computing the average time before insulation failure occurs. The results obtained from both Model simulation and the computer program of the mathematical formulation of the expected remaining life verified the mathematical analysis of the developed model in this thesis. The conclusion from this study shows that it is beneficial to base asset management decisions on a model that is verified with processed, analysed and tested outage data such as the model developed in this thesis

    Revamping Timing Error Resilience to Tackle Choke Points at NTC

    Get PDF
    The growing market of portable devices and smart wearables has contributed to innovation and development of systems with longer battery-life. While Near Threshold Computing (NTC) systems address the need for longer battery-life, they have certain limitations. NTC systems are prone to be significantly affected by variations in the fabrication process, commonly called process variation (PV). This dissertation explores an intriguing effect of PV, called choke points. Choke points are especially important due to their multifarious influence on the functional correctness of an NTC system. This work shows why novel research is required in this direction and proposes two techniques to resolve the problems created by choke points, while maintaining the reduced power needs
    • …
    corecore