10 research outputs found

    Runtime Management of Multiprocessor Systems for Fault Tolerance, Energy Efficiency and Load Balancing

    Get PDF
    Efficiency of modern multiprocessor systems is hurt by unpredictable events: aging causes permanent faults that disable components; application spawnings and terminations taking place at arbitrary times, affect energy proportionality, causing energy waste; load imbalances reduce resource utilization, penalizing performance. This thesis demonstrates how runtime management can mitigate the negative effects of unpredictable events, making decisions guided by a combination of static information known in advance and parameters that only become known at runtime. We propose techniques for three different objectives: graceful degradation of aging-prone systems; energy efficiency of heterogeneous adaptive systems; and load balancing by means of work stealing. Managing aging-prone systems for graceful efficiency degradation, is based on a high-level system description that encapsulates hardware reconfigurability and workload flexibility and allows to quantify system efficiency and use it as an objective function. Different custom heuristics, as well as simulated annealing and a genetic algorithm are proposed to optimize this objective function as a response to component failures. Custom heuristics are one to two orders of magnitude faster, provide better efficiency for the first 20% of system lifetime and are less than 13% worse than a genetic algorithm at the end of this lifetime. Custom heuristics occasionally fail to satisfy reconfiguration cost constraints. As all algorithms\u27 execution time scales well with respect to system size, a genetic algorithm can be used as backup in these cases. Managing heterogeneous multiprocessors capable of Dynamic Voltage and Frequency Scaling is based on a model that accurately predicts performance and power: performance is predicted by combining static, application-specific profiling information and dynamic, runtime performance monitoring data; power is predicted using the aforementioned performance estimations and a set of platform-specific, static parameters, determined only once and used for every application mix. Three runtime heuristics are proposed, that make use of this model to perform partial search of the configuration space, evaluating a small set of configurations and selecting the best one. When best-effort performance is adequate, the proposed approach achieves 3% higher energy efficiency compared to the powersave governor and 2x better compared to the interactive and ondemand governors. When individual applications\u27 performance requirements are considered, the proposed approach is able to satisfy them, giving away 18% of system\u27s energy efficiency compared to the powersave, which however misses the performance targets by 23%; at the same time, the proposed approach maintains an efficiency advantage of about 55% compared to the other governors, which also satisfy the requirements. Lastly, to improve load balancing of multiprocessors, a partial and approximate view of the current load distribution among system cores is proposed, which consists of lightweight data structures and is maintained by each core through cheap operations. A runtime algorithm is developed, using this view whenever a core becomes idle, to perform victim core selection for work stealing, also considering system topology and memory hierarchy. Among 12 diverse imbalanced workloads, the proposed approach achieves better performance than random, hierarchical and local stealing for six workloads. Furthermore, it is at most 8% slower among the other six workloads, while competing strategies incur a penalty of at least 89% on some workload

    A Holistic Solution for Reliability of 3D Parallel Systems

    Full text link
    As device scaling slows down, emerging technologies such as 3D integration and carbon nanotube field-effect transistors are among the most promising solutions to increase device density and performance. These emerging technologies offer shorter interconnects, higher performance, and lower power. However, higher levels of operating temperatures and current densities project significantly higher failure rates. Moreover, due to the infancy of the manufacturing process, high variation, and defect densities, chip designers are not encouraged to consider these emerging technologies as a stand-alone replacement for Silicon-based transistors. The goal of this dissertation is to introduce new architectural and circuit techniques that can work around high-fault rates in the emerging 3D technologies, improving performance and reliability comparable to Silicon. We propose a new holistic approach to the reliability problem that addresses the necessary aspects of an effective solution such as detection, diagnosis, repair, and prevention synergically for a practical solution. By leveraging 3D fabric layouts, it proposes the underlying architecture to efficiently repair the system in the presence of faults. This thesis presents a fault detection scheme by re-executing instructions on idle identical units that distinguishes between transient and permanent faults while localizing it to the granularity of a pipeline stage. Furthermore, with the use of a dynamic and adaptive reconfiguration policy based on activity factors and temperature variation, we propose a framework that delivers a significant improvement in lifetime management to prevent faults due to aging. Finally, a design framework that can be used for large-scale chip production while mitigating yield and variation failures to bring up Carbon Nano Tube-based technology is presented. The proposed framework is capable of efficiently supporting high-variation technologies by providing protection against manufacturing defects at different granularities: module and pipeline-stage levels.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/168118/1/javadb_1.pd

    Autonomous Recovery Of Reconfigurable Logic Devices Using Priority Escalation Of Slack

    Get PDF
    Field Programmable Gate Array (FPGA) devices offer a suitable platform for survivable hardware architectures in mission-critical systems. In this dissertation, active dynamic redundancy-based fault-handling techniques are proposed which exploit the dynamic partial reconfiguration capability of SRAM-based FPGAs. Self-adaptation is realized by employing reconfiguration in detection, diagnosis, and recovery phases. To extend these concepts to semiconductor aging and process variation in the deep submicron era, resilient adaptable processing systems are sought to maintain quality and throughput requirements despite the vulnerabilities of the underlying computational devices. A new approach to autonomous fault-handling which addresses these goals is developed using only a uniplex hardware arrangement. It operates by observing a health metric to achieve Fault Demotion using Recon- figurable Slack (FaDReS). Here an autonomous fault isolation scheme is employed which neither requires test vectors nor suspends the computational throughput, but instead observes the value of a health metric based on runtime input. The deterministic flow of the fault isolation scheme guarantees success in a bounded number of reconfigurations of the FPGA fabric. FaDReS is then extended to the Priority Using Resource Escalation (PURE) online redundancy scheme which considers fault-isolation latency and throughput trade-offs under a dynamic spare arrangement. While deep-submicron designs introduce new challenges, use of adaptive techniques are seen to provide several promising avenues for improving resilience. The scheme developed is demonstrated by hardware design of various signal processing circuits and their implementation on a Xilinx Virtex-4 FPGA device. These include a Discrete Cosine Transform (DCT) core, Motion Estimation (ME) engine, Finite Impulse Response (FIR) Filter, Support Vector Machine (SVM), and Advanced Encryption Standard (AES) blocks in addition to MCNC benchmark circuits. A iii significant reduction in power consumption is achieved ranging from 83% for low motion-activity scenes to 12.5% for high motion activity video scenes in a novel ME engine configuration. For a typical benchmark video sequence, PURE is shown to maintain a PSNR baseline near 32dB. The diagnosability, reconfiguration latency, and resource overhead of each approach is analyzed. Compared to previous alternatives, PURE maintains a PSNR within a difference of 4.02dB to 6.67dB from the fault-free baseline by escalating healthy resources to higher-priority signal processing functions. The results indicate the benefits of priority-aware resiliency over conventional redundancy approaches in terms of fault-recovery, power consumption, and resource-area requirements. Together, these provide a broad range of strategies to achieve autonomous recovery of reconfigurable logic devices under a variety of constraints, operating conditions, and optimization criteria

    Cost and benefits design optimization model for fault tolerant flight control systems

    Get PDF
    Requirements and specifications for a method of optimizing the design of fault-tolerant flight control systems are provided. Algorithms that could be used for developing new and modifying existing computer programs are also provided, with recommendations for follow-on work

    NASA space station automation: AI-based technology review

    Get PDF
    Research and Development projects in automation for the Space Station are discussed. Artificial Intelligence (AI) based automation technologies are planned to enhance crew safety through reduced need for EVA, increase crew productivity through the reduction of routine operations, increase space station autonomy, and augment space station capability through the use of teleoperation and robotics. AI technology will also be developed for the servicing of satellites at the Space Station, system monitoring and diagnosis, space manufacturing, and the assembly of large space structures

    Efficient redundancy selection for processor components to compensate permanent faults

    Get PDF
    Die stetige Skalierung von Fertigungstechnologien sorgte für einen rasanten Anstieg der Komplexität und damit auch der Verarbeitungsleistung von integrierten Schaltungen. Dies führte auch zu höheren Anforderungen an die Entwurfs- und Produktionsprozesse für diese Systeme. Zusätzlich dazu steigern Strukturen im Nanometerbereich die Anfälligkeit gegenüber physikalischen Effekten, welche sich in temporären und zunehmend auch dauerhaften Störungen der Funktionalität äußern können. Der Einsatz von Fehlertoleranz ist für diese komplexen Systeme nicht wegzudenken und wird für zukünftige anfälligere Fertigungstechnologien noch relevanter. In dieser Arbeit wird eine skalierbare Architektur zur Kompensation dauerhafter Störungen für beliebige Prozessorkomponenten vorgestellt. Der Einsatz dieser Architektur ist unabhängig von der Fehlerursache und kann sowohl direkt nach der Produktion als auch während des Einsatzes im Zielsystem genutzt werden. Durch die Verwendung dieser Architektur, auf aktiver Hardware-Redundanz basierend, ist eine Steigerung der Zuverlässigkeit, der Lebensdauer aber auch der Produktionsausbeute bei gleichbleibender Funktionalität möglich. Mit der Modellierung in dieser Arbeit wird die Effizienz der vorgestellten Architektur, unter Berücksichtigung der zusätzlichen Hardware für Redundanz und der notwendigen administrativen Komponenten, ermittelt und ermöglicht damit einen zielgerichteten Auswahlprozess für Prozessorkomponenten und die Menge ihrer Redundanz. Somit wird die optimale Redundanz für ein gegebenes System und ein zu erreichendes Ziel bereits im Entwurfsprozess bestimmtund kann damit frühzeitig bei der Umsetzung berücksichtigt werden. Neben der Beschreibung des Aufbaus der Architektur und ihrer Funktionsweise zeigt diese Arbeit wie sich eine Integration in bestehende Entwurfsprozesse mit gängigen Methoden und Werkzeugen realisieren lässt. Zusätzlich dazu wird die Systemmodellierung zur Realisierung des zielgerichteten Auswahlprozesses beschrieben. Anhand eines Anwendungsbeispiels wird die Möglichkeit der Umsetzung aufgezeigt und die daraus resultierenden Ergebnisse diskutiert.Steadily downscaling of production technologies led to a rapid increase in complexity and computing power of integrated circuits. This development raises also the requirements of design- and production processes of those systems. Structures in the nanometer regime enhance the susceptibility against physical effects, which can cause temporal and evermore also permanent faults. The usage of fault tolerance became essential for those complex systems and will be even more crucial in future technologies. This thesis presents a scalable hardware architecture for permanent fault compensation in arbitrary processor components. The utilization of this architecture is independent to the fault cause and is therefore suitable for fault compensation after production as well as in the field. Through the application of this architecture, based on active hardware redundancy, a gain in reliability, mean-lifetime and production yield is possible, while functionality is not degraded. System modeling in this thesis enables efficiency calculations for the presented architecture considering the additional hardware for redundancy and their administrative components. Therefore an efficient selection process for processor components and their amount of redundancy is possible. Consequently, the optimal amount of redundancy for a preexisting system and an objective to achieve can be calculated and is furthermore available early in the design process. Towards describing structure as well as functionality of the architecture this thesis show that the integration in existing design processes with usual methods and tools is possible. The used system modeling, which realizes the redundancy selection process, is described as well. Finally, an application example is used to exhibit the practicability of the presented approach. The resulting efficiency and the required costs of this approach for the chosen example are discussed, too

    Technology 2000, volume 1

    Get PDF
    The purpose of the conference was to increase awareness of existing NASA developed technologies that are available for immediate use in the development of new products and processes, and to lay the groundwork for the effective utilization of emerging technologies. There were sessions on the following: Computer technology and software engineering; Human factors engineering and life sciences; Information and data management; Material sciences; Manufacturing and fabrication technology; Power, energy, and control systems; Robotics; Sensors and measurement technology; Artificial intelligence; Environmental technology; Optics and communications; and Superconductivity
    corecore