452 research outputs found

    Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 1: Army fault tolerant architecture overview

    Get PDF
    Digital computing systems needed for Army programs such as the Computer-Aided Low Altitude Helicopter Flight Program and the Armored Systems Modernization (ASM) vehicles may be characterized by high computational throughput and input/output bandwidth, hard real-time response, high reliability and availability, and maintainability, testability, and producibility requirements. In addition, such a system should be affordable to produce, procure, maintain, and upgrade. To address these needs, the Army Fault Tolerant Architecture (AFTA) is being designed and constructed under a three-year program comprised of a conceptual study, detailed design and fabrication, and demonstration and validation phases. Described here are the results of the conceptual study phase of the AFTA development. Given here is an introduction to the AFTA program, its objectives, and key elements of its technical approach. A format is designed for representing mission requirements in a manner suitable for first order AFTA sizing and analysis, followed by a discussion of the current state of mission requirements acquisition for the targeted Army missions. An overview is given of AFTA's architectural theory of operation

    Constructing fail-controlled nodes for distributed systems: a software approach

    Get PDF
    PhD ThesisDesigning and implementing distributed systems which continue to provide specified services in the presence of processing site and communication failures is a difficult task. To facilitate their development, distributed systems have been built assuming that their underlying hardware components are Jail-controlled, i.e. present a well defined failure mode. However, if conventional hardware cannot provide the assumed failure mode, there is a need to build processing sites or nodes, and communication infra-structure that present the fail-controlled behaviour assumed. Coupling a number of redundant processors within a replicated node is a well known way of constructing fail-controlled nodes. Computation is replicated and executed simultaneously at each processor, and by employing suitable validation techniques to the outputs generated by processors (e.g. majority voting, comparison), outputs from faulty processors can be prevented from appearing at the application level. One way of constructing replicated nodes is by introducing hardwired mechanisms to couple replicated processors with specialised validation hardware circuits. Processors are tightly synchronised at the clock cycle level, and have their outputs validated by a reliable validation hardware. Another approach is to use software mechanisms to perform synchronisation of processors and validation of the outputs. The main advantage of hardware based nodes is the minimum performance overhead incurred. However, the introduction of special circuits may increase the complexity of the design tremendously. Further, every new microprocessor architecture requires considerable redesign overhead. Software based nodes do not present these problems, on the other hand, they introduce much bigger performance overheads to the system. In this thesis we investigate alternative ways of constructing efficient fail-controlled, software based replicated nodes. In particular, we present much more efficient order protocols, which are necessary for the implementation of these nodes. Our protocols, unlike others published to date, do not require processors' physical clocks to be explicitly synchronised. The main contribution of this thesis is the precise definition of the semantics of a software based Jail-silent node, along with its efficient design, implementation and performance evaluation.The Brazilian National Research Council (CNPq/Brasil)

    Automated Synthesis of Timed and Distributed Fault-Tolerant Systems

    Get PDF
    This dissertation concentrates on the problem of automated synthesis and repair of fault-tolerant systems. In particular, given the required specification of the system, our goal is to synthesize a fault-tolerant system, or repair an existing one. We study this problem for two classes of timed and distributed systems. In the context of timed systems, we focus on efficient synthesis of fault-tolerant timed models from their fault-intolerant version. Although the complexity of the synthesis problem is known to be polynomial time in the size of the time-abstract bisimulation of the input model, the state of the art lacked synthesis algorithms that can be efficiently implemented. This is in part due to the fact that synthesis is in general a challenging problem and its complexity is significantly magnified in the context of timed systems. We propose an algorithm that takes a timed automaton, a set of fault actions, and a set of safety and bounded-time response properties as input, and utilizes a space-efficient symbolic representation of the timed automaton (called the zone graph) to synthesize a fault-tolerant timed automaton as output. The output automaton satisfies strict phased recovery, where it is guaranteed that the output model behaves similarly to the input model in the absence of faults and in the presence of faults, fault recovery is achieved in two phases, each satisfying certain safety and timing constraints. In the context of distributed systems, we study the problem of synthesizing fault-tolerant systems from their intolerant versions, when the number of processes is unknown. To synthesize a distributed fault-tolerant protocol that works for systems with any number of processes, we use counter abstraction. Using this abstraction, we deal with a finite-state abstract model to do the synthesis. Applying our proposed algorithm, we successfully synthesized a fault-tolerant distributed agreement protocol in the presence of Byzantine fault. Although the synthesis problem is known to be NP-complete in the state space of the input protocol (due to partial observability of processes) in the non-parameterized setting, our parameterized algorithm manages to synthesize a solution for a complex problem such as Byzantine agreement within less than two minutes. A system may reach a bad state due to wrong initialization or fault occurrence. One of the well-known types of distributed fault-tolerant systems are self-stabilizing systems. These are the systems that converge to their legitimate states starting from any state, and if no fault occurs, stay in legitimate states thereafter. We propose an automated sound and complete method to synthesize self-stabilizing systems starting from the desired topology and type of the system. Our proposed method is based on SMT-solving, where the desired specification of the system is formulated as SMT constraints. We used the Alloy solver to implement our method, and successfully synthesized some of the well-known self-stabilizing algorithms. We extend our method to support a type of stabilizing algorithm called ideal-stabilization, and also the case when the set of legitimate states is not explicitly known. Quantitative metrics such as recovery time are crucial in self-stabilizing systems when used in practice (such as in networking applications). One of these metrics is the average recovery time. Our automated method for synthesizing self-stabilizing systems generate some solution that respects the desired system specification, but it does not take into account any quantitative metrics. We study the problem of repairing self-stabilizing systems (where only removal of transitions is allowed) to satisfy quantitative limitations. The metric under study is average recovery time, which characterizes the performance of stabilizing programs. We show that the repair problem is NP-complete in the state space of the given system

    Advanced Launch System Multi-Path Redundant Avionics Architecture Analysis and Characterization

    Get PDF
    The objective of the Multi-Path Redundant Avionics Suite (MPRAS) program is the development of a set of avionic architectural modules which will be applicable to the family of launch vehicles required to support the Advanced Launch System (ALS). To enable ALS cost/performance requirements to be met, the MPRAS must support autonomy, maintenance, and testability capabilities which exceed those present in conventional launch vehicles. The multi-path redundant or fault tolerance characteristics of the MPRAS are necessary to offset a reduction in avionics reliability due to the increased complexity needed to support these new cost reduction and performance capabilities and to meet avionics reliability requirements which will provide cost-effective reductions in overall ALS recurring costs. A complex, real-time distributed computing system is needed to meet the ALS avionics system requirements. General Dynamics, Boeing Aerospace, and C.S. Draper Laboratory have proposed system architectures as candidates for the ALS MPRAS. The purpose of this document is to report the results of independent performance and reliability characterization and assessment analyses of each proposed candidate architecture and qualitative assessments of testability, maintainability, and fault tolerance mechanisms. These independent analyses were conducted as part of the MPRAS Part 2 program and were carried under NASA Langley Research Contract NAS1-17964, Task Assignment 28

    Cross layer reliability estimation for digital systems

    Get PDF
    Forthcoming manufacturing technologies hold the promise to increase multifuctional computing systems performance and functionality thanks to a remarkable growth of the device integration density. Despite the benefits introduced by this technology improvements, reliability is becoming a key challenge for the semiconductor industry. With transistor size reaching the atomic dimensions, vulnerability to unavoidable fluctuations in the manufacturing process and environmental stress rise dramatically. Failing to meet a reliability requirement may add excessive re-design cost to recover and may have severe consequences on the success of a product. %Worst-case design with large margins to guarantee reliable operation has been employed for long time. However, it is reaching a limit that makes it economically unsustainable due to its performance, area, and power cost. One of the open challenges for future technologies is building ``dependable'' systems on top of unreliable components, which will degrade and even fail during normal lifetime of the chip. Conventional design techniques are highly inefficient. They expend significant amount of energy to tolerate the device unpredictability by adding safety margins to a circuit's operating voltage, clock frequency or charge stored per bit. Unfortunately, the additional cost introduced to compensate unreliability are rapidly becoming unacceptable in today's environment where power consumption is often the limiting factor for integrated circuit performance, and energy efficiency is a top concern. Attention should be payed to tailor techniques to improve the reliability of a system on the basis of its requirements, ending up with cost-effective solutions favoring the success of the product on the market. Cross-layer reliability is one of the most promising approaches to achieve this goal. Cross-layer reliability techniques take into account the interactions between the layers composing a complex system (i.e., technology, hardware and software layers) to implement efficient cross-layer fault mitigation mechanisms. Fault tolerance mechanism are carefully implemented at different layers starting from the technology up to the software layer to carefully optimize the system by exploiting the inner capability of each layer to mask lower level faults. For this purpose, cross-layer reliability design techniques need to be complemented with cross-layer reliability evaluation tools, able to precisely assess the reliability level of a selected design early in the design cycle. Accurate and early reliability estimates would enable the exploration of the system design space and the optimization of multiple constraints such as performance, power consumption, cost and reliability. This Ph.D. thesis is devoted to the development of new methodologies and tools to evaluate and optimize the reliability of complex digital systems during the early design stages. More specifically, techniques addressing hardware accelerators (i.e., FPGAs and GPUs), microprocessors and full systems are discussed. All developed methodologies are presented in conjunction with their application to real-world use cases belonging to different computational domains

    Mixed-Criticality Systems on Commercial-Off-the-Shelf Multi-Processor Systems-on-Chip

    Get PDF
    Avionics and space industries are struggling with the adoption of technologies like multi-processor system-on-chips (MPSoCs) due to strict safety requirements. This thesis propose a new reference architecture for MPSoC-based mixed-criticality systems (MCS) - i.e., systems integrating applications with different level of criticality - which are a common use case for aforementioned industries. This thesis proposes a system architecture capable of granting partitioning - which is, for short, the property of fault containment. It is based on the detection of spatial and temporal interference, and has been named the online detection of interference (ODIn) architecture. Spatial partitioning requires that an application is not able to corrupt resources used by a different application. In the architecture proposed in this thesis, spatial partitioning is implemented using type-1 hypervisors, which allow definition of resource partitions. An application running in a partition can only access resources granted to that partition, therefore it cannot corrupt resources used by applications running in other partitions. Temporal partitioning requires that an application is not able to unexpectedly change the execution time of other applications. In the proposed architecture, temporal partitioning has been solved using a bounded interference approach, composed of an offline analysis phase and an online safety net. The offline phase is based on a statistical profiling of a metric sensitive to temporal interference’s, performed in nominal conditions, which allows definition of a set of three thresholds: 1. the detection threshold TD; 2. the warning threshold TW ; 3. the α threshold. Two rules of detection are defined using such thresholds: Alarm rule When the value of the metric is above TD. Warning rule When the value of the metric is in the warning region [TW ;TD] for more than α consecutive times. ODIn’s online safety-net exploits performance counters, available in many MPSoC architectures; such counters are configured at bootstrap to monitor the selected metric(s), and to raise an interrupt request (IRQ) in case the metric value goes above TD, implementing the alarm rule. The warning rule is implemented in a software detection module, which reads the value of performance counters when the monitored task yields control to the scheduler and reset them if there is no detection. ODIn also uses two additional detection mechanisms: 1. a control flow check technique, based on compile-time defined block signatures, is implemented through a set of watchdog processors, each monitoring one partition. 2. a timeout is implemented through a system watchdog timer (SWDT), which is able to send an external signal when the timeout is violated. The recovery actions implemented in ODIn are: • graceful degradation, to react to IRQs of WDPs monitoring non-critical applications or to warning rule violations; it temporarily stops non-critical applications to grant resources to the critical application; • hard recovery, to react to the SWDT, to the WDP of the critical application, or to alarm rule violations; it causes a switch to a hot stand-by spare computer. Experimental validation of ODIn was performed on two hardware platforms: the ZedBoard - dual-core - and the Inventami board - quad-core. A space benchmark and an avionic benchmark were implemented on both platforms, composed by different modules as showed in Table 1 Each version of the final application was evaluated through fault injection (FI) campaigns, performed using a specifically designed FI system. There were three types of FI campaigns: 1. HW FI, to emulate single event effects; 2. SW FI, to emulate bugs in non-critical applications; 3. artificial bug FI, to emulate a bug in non-critical applications introducing unexpected interference on the critical application. Experimental results show that ODIn is resilient to all considered types of faul

    Dependability modeling and evaluation – From AADL to stochastic Petri nets

    Get PDF
    Conduire des analyses de sûreté de fonctionnement conjointement avec d'autres analyses au niveau architectural permet à la fois de prédire les effets des décisions architecturales sur la sûreté de fonctionnement du système et de faire des compromis. Par conséquent, les industriels et les universitaires se concentrent sur la définition d'approches d'ingénierie guidées par des modèles (MDE) et sur l'intégration de diverses analyses dans le processus de développement. AADL (Architecture Analysis and Design Language) a prouvé son aptitude pour la modélisation d'architectures et ce langage est actuellement jugé efficace par les industriels dans de telles approches. Notre contribution est un cadre de modélisation permettant la génération de modèles analytiques de sûreté de fonctionnement à partir de modèles AADL dans l‘objectif de faciliter l'évaluation de mesures de sûreté de fonctionnement comme la fiabilité et la disponibilité. Nous proposons une approche itérative de modélisation. Dans ce contexte, nous fournissons un ensemble de sous-modèles génériques réutilisables pour des architectures tolérantes aux fautes. Le modèle AADL de sûreté de fonctionnement est transformé en un RdPSG (Réseau de Petri Stochastique Généralisé) en appliquant des règles de transformation de modèle. Nous avons implémenté un outil de transformation automatique. Le RdPSG résultant peut être traité par des outils existants pour obtenir des mesures de sûreté de fonctionnement. L'approche est illustrée sur un ensemble du Système Informatique Français de Contrôle de Trafic Aérien. ABSTRACT : Performing dependability evaluation along with other analyses at architectural level allows both predicting the effects of architectural decisions on the dependability of a system and making tradeoffs. Thus, both industry and academia focus on defining model driven engineering (MDE) approaches and on integrating several analyses in the development process. AADL (Architecture Analysis and Design Language) has proved to be efficient for architectural modeling and is considered by industry in the context presented above. Our contribution is a modeling framework allowing the generation of dependability-oriented analytical models from AADL models, to facilitate the evaluation of dependability measures, such as reliability or availability. We propose an iterative approach for system dependability modeling using AADL. In this context, we also provide a set of reusable modeling patterns for fault tolerant architectures. The AADL dependability model is transformed into a GSPN (Generalized Stochastic Petri Net) by applying model transformation rules. We have implemented an automatic model transformation tool. The resulting GSPN can be processed by existing tools to obtain dependability measures. The modeling approach is illustrated on a subsystem of the French Air trafic Control System

    Revised reference model

    Get PDF
    This document contains an update of the HIDENETS Reference Model, whose preliminary version was introduced in D1.1. The Reference Model contains the overall approach to development and assessment of end-to-end resilience solutions. As such, it presents a framework, which due to its abstraction level is not only restricted to the HIDENETS car-to-car and car-to-infrastructure applications and use-cases. Starting from a condensed summary of the used dependability terminology, the network architecture containing the ad hoc and infrastructure domain and the definition of the main networking elements together with the software architecture of the mobile nodes is presented. The concept of architectural hybridization and its inclusion in HIDENETS-like dependability solutions is described subsequently. A set of communication and middleware level services following the architecture hybridization concept and motivated by the dependability and resilience challenges raised by HIDENETS-like scenarios is then described. Besides architecture solutions, the reference model addresses the assessment of dependability solutions in HIDENETS-like scenarios using quantitative evaluations, realized by a combination of top-down and bottom-up modelling, as well as verification via test scenarios. In order to allow for fault prevention in the software development phase of HIDENETS-like applications, generic UML-based modelling approaches with focus on dependability related aspects are described. The HIDENETS reference model provides the framework in which the detailed solution in the HIDENETS project are being developed, while at the same time facilitating the same task for non-vehicular scenarios and application

    Low-Power and Error-Resilient VLSI Circuits and Systems.

    Full text link
    Efficient low-power operation is critically important for the success of the next-generation signal processing applications. Device and supply voltage have been continuously scaled to meet a more constrained power envelope, but scaling has created resiliency challenges, including increasing timing faults and soft errors. Our research aims at designing low-power and robust circuits and systems for signal processing by drawing circuit, architecture, and algorithm approaches. To gain an insight into the system faults due to supply voltage reduction, we researched the two primary effects that determine the minimum supply voltage (VMIN) in Intel’s tri-gate CMOS technology, namely process variations and gate-dielectric soft breakdown. We determined that voltage scaling increases the timing window that sequential circuits are vulnerable. Thus, we proposed a new hold-time violation metric to define hold-time VMIN, which has been adopted as a new design standard. Device scaling increases soft errors which affect circuit reliability. Through extensive soft error characterization using two 65nm CMOS test chips, we studied the soft error mechanisms and its dependence on supply voltage and clock frequency. This study laid the foundation of the first 65nm DSP chip design for a NASA spaceflight project. To mitigate such random errors, we proposed a new confidence-driven architecture that effectively enhances the error resiliency of deeply scaled CMOS and post-CMOS circuits. Designing low-power resilient systems can effectively leverage application-specific algorithmic approaches. To explore design opportunities in the algorithmic domain, we demonstrate an application-specific detection and decoding processor for multiple-input multiple-output (MIMO) wireless communication. To enhance the receive error rate for a robust wireless communication, we designed a joint detection and decoding technique by enclosing detection and decoding in an iterative loop to enhance both interference cancellation and error reduction. A proof-of-concept chip design was fabricated for the next-generation 4x4 256QAM MIMO systems. Through algorithm-architecture optimizations and low-power circuit techniques, our design achieves significant improvements in throughput, energy efficiency and error rate, paving the way for future developments in this area.PhDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/110323/1/uchchen_1.pd
    • …
    corecore