2,306 research outputs found

    Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 2: Army fault tolerant architecture design and analysis

    Get PDF
    Described here is the Army Fault Tolerant Architecture (AFTA) hardware architecture and components and the operating system. The architectural and operational theory of the AFTA Fault Tolerant Data Bus is discussed. The test and maintenance strategy developed for use in fielded AFTA installations is presented. An approach to be used in reducing the probability of AFTA failure due to common mode faults is described. Analytical models for AFTA performance, reliability, availability, life cycle cost, weight, power, and volume are developed. An approach is presented for using VHSIC Hardware Description Language (VHDL) to describe and design AFTA's developmental hardware. A plan is described for verifying and validating key AFTA concepts during the Dem/Val phase. Analytical models and partial mission requirements are used to generate AFTA configurations for the TF/TA/NOE and Ground Vehicle missions

    Reliable massively parallel symbolic computing : fault tolerance for a distributed Haskell

    Get PDF
    As the number of cores in manycore systems grows exponentially, the number of failures is also predicted to grow exponentially. Hence massively parallel computations must be able to tolerate faults. Moreover new approaches to language design and system architecture are needed to address the resilience of massively parallel heterogeneous architectures. Symbolic computation has underpinned key advances in Mathematics and Computer Science, for example in number theory, cryptography, and coding theory. Computer algebra software systems facilitate symbolic mathematics. Developing these at scale has its own distinctive set of challenges, as symbolic algorithms tend to employ complex irregular data and control structures. SymGridParII is a middleware for parallel symbolic computing on massively parallel High Performance Computing platforms. A key element of SymGridParII is a domain specific language (DSL) called Haskell Distributed Parallel Haskell (HdpH). It is explicitly designed for scalable distributed-memory parallelism, and employs work stealing to load balance dynamically generated irregular task sizes. To investigate providing scalable fault tolerant symbolic computation we design, implement and evaluate a reliable version of HdpH, HdpH-RS. Its reliable scheduler detects and handles faults, using task replication as a key recovery strategy. The scheduler supports load balancing with a fault tolerant work stealing protocol. The reliable scheduler is invoked with two fault tolerance primitives for implicit and explicit work placement, and 10 fault tolerant parallel skeletons that encapsulate common parallel programming patterns. The user is oblivious to many failures, they are instead handled by the scheduler. An operational semantics describes small-step reductions on states. A simple abstract machine for scheduling transitions and task evaluation is presented. It defines the semantics of supervised futures, and the transition rules for recovering tasks in the presence of failure. The transition rules are demonstrated with a fault-free execution, and three executions that recover from faults. The fault tolerant work stealing has been abstracted in to a Promela model. The SPIN model checker is used to exhaustively search the intersection of states in this automaton to validate a key resiliency property of the protocol. It asserts that an initially empty supervised future on the supervisor node will eventually be full in the presence of all possible combinations of failures. The performance of HdpH-RS is measured using five benchmarks. Supervised scheduling achieves a speedup of 757 with explicit task placement and 340 with lazy work stealing when executing Summatory Liouville up to 1400 cores of a HPC architecture. Moreover, supervision overheads are consistently low scaling up to 1400 cores. Low recovery overheads are observed in the presence of frequent failure when lazy on-demand work stealing is used. A Chaos Monkey mechanism has been developed for stress testing resiliency with random failure combinations. All unit tests pass in the presence of random failure, terminating with the expected results

    EOS: A project to investigate the design and construction of real-time distributed embedded operating systems

    Get PDF
    The EOS project is investigating the design and construction of a family of real-time distributed embedded operating systems for reliable, distributed aerospace applications. Using the real-time programming techniques developed in co-operation with NASA in earlier research, the project staff is building a kernel for a multiple processor networked system. The first six months of the grant included a study of scheduling in an object-oriented system, the design philosophy of the kernel, and the architectural overview of the operating system. In this report, the operating system and kernel concepts are described. An environment for the experiments has been built and several of the key concepts of the system have been prototyped. The kernel and operating system is intended to support future experimental studies in multiprocessing, load-balancing, routing, software fault-tolerance, distributed data base design, and real-time processing

    Dependability assessment of by-wire control systems using fault injection

    Full text link
    This paper is focused on the validation by means of physical fault injection at pin-level of a time-triggered communication controller: the TTP/C versions C1 and C2. The controller is a commercial off-the-shelf product used in the design of by-wire systems. Drive-by-wire and fly-by-wire active safety controls aim to prevent accidents. They are considered to be of critical importance because a serious situation may directly affect user safety. Therefore, dependability assessment is vital in their design. This work was funded by the European project `Fault Injection for TTA¿ and it is divided into two parts. In the first part, there is a verification of the dependability specifications of the TTP communication protocol, based on TTA, in the presence of faults directly induced in communication lines. The second part contains a validation and improvement proposal for the architecture in case of data errors. Such errors are due to faults that occurred during writing (or reading) actions on memory or during data storage.Blanc Clavero, S.; Bonastre Pina, AM.; Gil, P. (2009). Dependability assessment of by-wire control systems using fault injection. Journal of Systems Architecture. 55(2):102-113. doi:10.1016/j.sysarc.2008.09.003S10211355

    Application Agreement and Integration Services

    Get PDF
    Application agreement and integration services are required by distributed, fault-tolerant, safety critical systems to assure required performance. An analysis of distributed and hierarchical agreement strategies are developed against the backdrop of observed agreement failures in fielded systems. The documented work was performed under NASA Task Order NNL10AB32T, Validation And Verification of Safety-Critical Integrated Distributed Systems Area 2. This document is intended to satisfy the requirements for deliverable 5.2.11 under Task 4.2.2.3. This report discusses the challenges of maintaining application agreement and integration services. A literature search is presented that documents previous work in the area of replica determinism. Sources of non-deterministic behavior are identified and examples are presented where system level agreement failed to be achieved. We then explore how TTEthernet services can be extended to supply some interesting application agreement frameworks. This document assumes that the reader is familiar with the TTEthernet protocol. The reader is advised to read the TTEthernet protocol standard [1] before reading this document. This document does not re-iterate the content of the standard

    Perfomance Analysis and Resource Optimisation of Critical Systems Modelled by Petri Nets

    Get PDF
    Un sistema crítico debe cumplir con su misión a pesar de la presencia de problemas de seguridad. Este tipo de sistemas se suele desplegar en entornos heterogéneos, donde pueden ser objeto de intentos de intrusión, robo de información confidencial u otro tipo de ataques. Los sistemas, en general, tienen que ser rediseñados después de que ocurra un incidente de seguridad, lo que puede conducir a consecuencias graves, como el enorme costo de reimplementar o reprogramar todo el sistema, así como las posibles pérdidas económicas. Así, la seguridad ha de ser concebida como una parte integral del desarrollo de sistemas y como una necesidad singular de lo que el sistema debe realizar (es decir, un requisito no funcional del sistema). Así pues, al diseñar sistemas críticos es fundamental estudiar los ataques que se pueden producir y planificar cómo reaccionar frente a ellos, con el fin de mantener el cumplimiento de requerimientos funcionales y no funcionales del sistema. A pesar de que los problemas de seguridad se consideren, también es necesario tener en cuenta los costes incurridos para garantizar un determinado nivel de seguridad en sistemas críticos. De hecho, los costes de seguridad puede ser un factor muy relevante ya que puede abarcar diferentes dimensiones, como el presupuesto, el rendimiento y la fiabilidad. Muchos de estos sistemas críticos que incorporan técnicas de tolerancia a fallos (sistemas FT) para hacer frente a las cuestiones de seguridad son sistemas complejos, que utilizan recursos que pueden estar comprometidos (es decir, pueden fallar) por la activación de los fallos y/o errores provocados por posibles ataques. Estos sistemas pueden ser modelados como sistemas de eventos discretos donde los recursos son compartidos, también llamados sistemas de asignación de recursos. Esta tesis se centra en los sistemas FT con recursos compartidos modelados mediante redes de Petri (Petri nets, PN). Estos sistemas son generalmente tan grandes que el cálculo exacto de su rendimiento se convierte en una tarea de cálculo muy compleja, debido al problema de la explosión del espacio de estados. Como resultado de ello, una tarea que requiere una exploración exhaustiva en el espacio de estados es incomputable (en un plazo prudencial) para sistemas grandes. Las principales aportaciones de esta tesis son tres. Primero, se ofrecen diferentes modelos, usando el Lenguaje Unificado de Modelado (Unified Modelling Language, UML) y las redes de Petri, que ayudan a incorporar las cuestiones de seguridad y tolerancia a fallos en primer plano durante la fase de diseño de los sistemas, permitiendo así, por ejemplo, el análisis del compromiso entre seguridad y rendimiento. En segundo lugar, se proporcionan varios algoritmos para calcular el rendimiento (también bajo condiciones de fallo) mediante el cálculo de cotas de rendimiento superiores, evitando así el problema de la explosión del espacio de estados. Por último, se proporcionan algoritmos para calcular cómo compensar la degradación de rendimiento que se produce ante una situación inesperada en un sistema con tolerancia a fallos
    corecore