9 research outputs found

    Diversity TMR: Proof of Concept in a Mixed-Signal Case

    Get PDF
    Abstract-In this paper a design diversity fault tolerance technique is applied to a mixed-signal (MS) system. Three different implementations of a second order low-pass filter (which perform the same transfer function) associated to a majority voter are used to build the TMR scheme. The whole system is prototyped by using a programmable mixed-signal device. Some functional faults are injected into the circuit blocks and practical measurements are made on the prototyped system. Results show that the design diversity TMR is a feasible technique that can increase reliability of some classes of state-ofart MS circuits

    Energy Efficient Configuration for QoS in Reliable Parallel Servers

    Full text link

    A Survey of Fault-Tolerance Techniques for Embedded Systems from the Perspective of Power, Energy, and Thermal Issues

    Get PDF
    The relentless technology scaling has provided a significant increase in processor performance, but on the other hand, it has led to adverse impacts on system reliability. In particular, technology scaling increases the processor susceptibility to radiation-induced transient faults. Moreover, technology scaling with the discontinuation of Dennard scaling increases the power densities, thereby temperatures, on the chip. High temperature, in turn, accelerates transistor aging mechanisms, which may ultimately lead to permanent faults on the chip. To assure a reliable system operation, despite these potential reliability concerns, fault-tolerance techniques have emerged. Specifically, fault-tolerance techniques employ some kind of redundancies to satisfy specific reliability requirements. However, the integration of fault-tolerance techniques into real-time embedded systems complicates preserving timing constraints. As a remedy, many task mapping/scheduling policies have been proposed to consider the integration of fault-tolerance techniques and enforce both timing and reliability guarantees for real-time embedded systems. More advanced techniques aim additionally at minimizing power and energy while at the same time satisfying timing and reliability constraints. Recently, some scheduling techniques have started to tackle a new challenge, which is the temperature increase induced by employing fault-tolerance techniques. These emerging techniques aim at satisfying temperature constraints besides timing and reliability constraints. This paper provides an in-depth survey of the emerging research efforts that exploit fault-tolerance techniques while considering timing, power/energy, and temperature from the real-time embedded systems’ design perspective. In particular, the task mapping/scheduling policies for fault-tolerance real-time embedded systems are reviewed and classified according to their considered goals and constraints. Moreover, the employed fault-tolerance techniques, application models, and hardware models are considered as additional dimensions of the presented classification. Lastly, this survey gives deep insights into the main achievements and shortcomings of the existing approaches and highlights the most promising ones

    Power-Aware Resilience for Exascale Computing

    Get PDF
    To enable future scientific breakthroughs and discoveries, the next generation of scientific applications will require exascale computing performance to support the execution of predictive models and analysis of massive quantities of data, with significantly higher resolution and fidelity than what is possible within existing computing infrastructure. Delivering exascale performance will require massive parallelism, which could result in a computing system with over a million sockets, each supporting many cores. Resulting in a system with millions of components, including memory modules, communication networks, and storage devices. This increase in number of components significantly increases the propensity of exascale computing systems to faults, while driving power consumption and operating costs to unforeseen heights. To achieve exascale performance two challenges must be addressed: resilience to failures and adherence to power budget constraints. These two objectives conflict insofar as performance is concerned, as achieving high performance may push system components past their thermal limit and increase the likelihood of failure. With current systems, the dominant resilience technique is checkpoint/restart. It is believed, however, that this technique alone will not scale to the level necessary to support future systems. Therefore, alternative methods have been suggested to augment checkpoint/restart -- for example process replication. In this thesis, we present a new fault tolerance model called shadow replication that addresses resilience and power simultaneously. Shadow replication associates a shadow process with each main process, similar to traditional replication, however, the shadow process executes at a reduced speed. Shadow replication reduces energy consumption and produces solutions faster than checkpoint/restart and other replication methods in limited power environments. Shadow replication reduces energy consumption up to 25 depending upon the application type, system parameters, and failure rates. The major contribution of this thesis is the development of shadow replication, a power-aware fault tolerant computational model. The second contribution is an execution model applying shadow replication to future high performance exascale-class systems. Next, is a framework to analyze and simulate the power and energy consumption of fault tolerance methods in high performance computing systems. Lastly, to prove the viability of shadow replication an implementation is presented for the Message Passing Interface

    Energy and Reliability Management in Parallel Real-Time Systems

    Get PDF
    Historically, slack time in real-time systems has been used as temporal redundancy by rollback recovery schemes to increase system reliability in the presence of faults. However, with advancedtechnologies, slack time can also be used by energy management schemes to save energy. For reliable real-time systems where higher levels of reliability are as important as lower levels of energy consumption, centralized management of slack time is desired.For frame-based parallel real-time applications, energy management schemes are first explored. Although the simple static power management that evenly allocates static slack over a schedule isoptimal for uni-processor systems, it is not optimal for parallel systems due to different levels of parallelism in a schedule. Taking parallelism variations into consideration, a parallel static power management scheme is proposed. When dynamic slack is considered,assuming global scheduling strategies, slack shifting and sharing schemes as well as speculation schemes are proposed for moreenergy savings.For simultaneous management of power and reliability, checkpointing techniques are first deployed to efficiently use slack time and theoptimal numbers of checkpoints needed to minimize energy consumption or to maximize system reliability are explored. Then, an energyefficient optimistic modular redundancy scheme is addressed. Finally, a framework that encompasses energy and reliability management isproposed for obtaining optimal redundant configurations. While exploring the trade-off between energy and reliability, the effects ofvoltage scaling on fault rates are considered

    Maßnahmen zur Steigerung der Zuverlässigkeit integrierter Schaltungen auf Gatterebene hinsichtlich Gateoxiddefekten

    Get PDF
    Die fortschreitende Skalierung führt zur Verbesserung dynamischer Parameter einer integrierten Schaltung, aber auch zu Verschleißerscheinungen, die die Lebensdauer dieser Schaltungen allein durch den Betrieb signifikant begrenzen. Die vorliegende Arbeit stellt neue Ansätze auf Gatterebene zur Erhöhung der Zuverlässigkeit für kombinatorische integrierte Schaltungen hinsichtlich Gateoxiddefekten vor, die sich in einen standardisierten CMOS-Designablauf integrieren lassen. Des Weiteren befasst sich diese Arbeit mit der Entwicklung eines Simulators zur Analyse der Auswirkungen von Gateoxiddefekten

    Energy-Efficient Duplex and TMR Real-Time Systems

    No full text
    Duplex and Triple Modular Redundancy (TMR) systems are used when a high-level of reliability is desired. Real-Time Systems for autonomous critical missions need such degrees of reliability, but energy consumption becomes a dominant concern when these systems are built out of high-performance processors that consume a large budget of electrical power for operation and cooling. Examples where energy consumption and real time are of paramount importance include reliable computers onboard mobile vehicles, such as the Mars Rover, satelites, and other autonomous vehicles. At first inspection, a duplex system uses about two thirds of the components that a TMR system does, leading one to conclude that duplex systems are more energy-efficient. This paper shows that this is not always the case. We present an analysis of the energy efficiency of duplex and TMR systems when used to tolerate transient failures. With no power management deployed, the analysis supports the intuitive impression about the relative superiority of duplex systems in energy consumption. The analysis shows, however, that the gap in energy consumption between the two types of systems diminishes with proper power management. We introduce the concept of an optimistic TMR system that offers the same reliability and performance as the traditional one, but at a fraction of the energy consumption budget. Optimistic TMR systems are competitive with respect to energy consumption when compared with a power-aware duplex system, can even exceed it in some situations, and have the added bonus of providing tolerance to permanent faults
    corecore