1,681 research outputs found

    Special session: Operating systems under test: An overview of the significance of the operating system in the resiliency of the computing continuum

    Get PDF
    The computing continuum's actual trend is facing a growth in terms of devices with any degree of computational capability. Those devices may or may not include a full-stack, including the Operating System layer and the Application layer, or just facing pure bare-metal solutions. In either case, the reliability of the full system stack has to be guaranteed. It is crucial to provide data regarding the impact of faults at all system stack levels and potential hardening solutions to design highly resilient systems. While most of the work usually concentrates on the application reliability, the special session aims to provide a deep comprehension of the impact on the reliability of an embedded system when faults in the hardware substrate of the system stack surface at the Operating System layer. For this reason, we will cover a comparison from an application perspective when hardware faults happen in bare metal vs. real-time OS vs. general-purpose OS. Then we will go deeper within a FreeRTOS to evaluate the contribution of all parts of the OS. Eventually, the Special Session will propose some hardening techniques at the Operating System level by exploiting the scheduling capabilities

    An Optimization Based Design for Integrated Dependable Real-Time Embedded Systems

    Get PDF
    Moving from the traditional federated design paradigm, integration of mixedcriticality software components onto common computing platforms is increasingly being adopted by automotive, avionics and the control industry. This method faces new challenges such as the integration of varied functionalities (dependability, responsiveness, power consumption, etc.) under platform resource constraints and the prevention of error propagation. Based on model driven architecture and platform based design’s principles, we present a systematic mapping process for such integration adhering a transformation based design methodology. Our aim is to convert/transform initial platform independent application specifications into post integration platform specific models. In this paper, a heuristic based resource allocation approach is depicted for the consolidated mapping of safety critical and non-safety critical applications onto a common computing platform meeting particularly dependability/fault-tolerance and real-time requirements. We develop a supporting tool suite for the proposed framework, where VIATRA (VIsual Automated model TRAnsformations) is used as a transformation tool at different design steps. We validate the process and provide experimental results to show the effectiveness, performance and robustness of the approach

    Approximate Computing Strategies for Low-Overhead Fault Tolerance in Safety-Critical Applications

    Get PDF
    This work studies the reliability of embedded systems with approximate computing on software and hardware designs. It presents approximate computing methods and proposes approximate fault tolerance techniques applied to programmable hardware and embedded software to provide reliability at low computational costs. The objective of this thesis is the development of fault tolerance techniques based on approximate computing and proving that approximate computing can be applied to most safety-critical systems. It starts with an experimental analysis of the reliability of embedded systems used at safety-critical projects. Results show that the reliability of single-core systems, and types of errors they are sensitive to, differ from multicore processing systems. The usage of an operating system and two different parallel programming APIs are also evaluated. Fault injection experiment results show that embedded Linux has a critical impact on the system’s reliability and the types of errors to which it is most sensitive. Traditional fault tolerance techniques and parallel variants of them are evaluated for their fault-masking capability on multicore systems. The work shows that parallel fault tolerance can indeed not only improve execution time but also fault-masking. Lastly, an approximate parallel fault tolerance technique is proposed, where the system abandons faulty execution tasks. This first approximate computing approach to fault tolerance in parallel processing systems was able to improve the reliability and the fault-masking capability of the techniques, significantly reducing errors that would cause system crashes. Inspired by the conflict between the improvements provided by approximate computing and the safety-critical systems requirements, this work presents an analysis of the applicability of approximate computing techniques on critical systems. The proposed techniques are tested under simulation, emulation, and laser fault injection experiments. Results show that approximate computing algorithms do have a particular behavior, different from traditional algorithms. The approximation techniques presented and proposed in this work are also used to develop fault tolerance techniques. Results show that those new approximate fault tolerance techniques are less costly than traditional ones and able to achieve almost the same level of error masking.Este trabalho estuda a confiabilidade de sistemas embarcados com computação aproximada em software e projetos de hardware. Ele apresenta métodos de computação aproximada e técnicas aproximadas para tolerância a falhas em hardware programável e software embarcado que provêem alta confiabilidade a baixos custos computacionais. O objetivo desta tese é o desenvolvimento de técnicas de tolerância a falhas baseadas em computação aproximada e provar que este paradigma pode ser usado em sistemas críticos. O texto começa com uma análise da confiabilidade de sistemas embarcados usados em sistemas de tolerância crítica. Os resultados mostram que a resiliência de sistemas singlecore, e os tipos de erros aos quais eles são mais sensíveis, é diferente dos multi-core. O uso de sistemas operacionais também é analisado, assim como duas APIs de programação paralela. Experimentos de injeção de falhas mostram que o uso de Linux embarcado tem um forte impacto na confiabilidade do sistema. Técnicas tradicionais de tolerância a falhas e variações paralelas das mesmas são avaliadas. O trabalho mostra que técnicas de tolerância a falhas paralelas podem de fato melhorar não apenas o tempo de execução da aplicação, mas também seu mascaramento de erros. Por fim, uma técnica de tolerância a falhas paralela aproximada é proposta, onde o sistema abandona instâncias de execuções que apresentam falhas. Esta primeira experiência com computação aproximada foi capaz de melhorar a confiabilidade das técnicas previamente apresentadas, reduzindo significativamente a ocorrência de erros que provocam um crash total do sistema. Inspirado pelo conflito entre as melhorias trazidas pela computação aproximada e os requisitos dos sistemas críticos, este trabalho apresenta uma análise da aplicabilidade de computação aproximada nestes sistemas. As técnicas propostas são testadas sob experimentos de injeção de falhas por simulação, emulação e laser. Os resultados destes experimentos mostram que algoritmos aproximados possuem um comportamento particular que lhes é inerente, diferente dos tradicionais. As técnicas de aproximação apresentadas e propostas no trabalho são também utilizadas para o desenvolvimento de técnicas de tolerância a falhas aproximadas. Estas novas técnicas possuem um custo menor que as tradicionais e são capazes de atingir o mesmo nível de mascaramento de erros

    Software Fault Tolerance in Real-Time Systems: Identifying the Future Research Questions

    Get PDF
    Tolerating hardware faults in modern architectures is becoming a prominent problem due to the miniaturization of the hardware components, their increasing complexity, and the necessity to reduce the costs. Software-Implemented Hardware Fault Tolerance approaches have been developed to improve the system dependability to hardware faults without resorting to custom hardware solutions. However, these come at the expense of making the satisfaction of the timing constraints of the applications/activities harder from a scheduling standpoint. This paper surveys the current state of the art of fault tolerance approaches when used in the context real-time systems, identifying the main challenges and the cross-links between these two topics. We propose a joint scheduling-failure analysis model that highlights the formal interactions among software fault tolerance mechanisms and timing properties. This model allows us to present and discuss many open research questions with the final aim to spur the future research activities

    Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 1: Army fault tolerant architecture overview

    Get PDF
    Digital computing systems needed for Army programs such as the Computer-Aided Low Altitude Helicopter Flight Program and the Armored Systems Modernization (ASM) vehicles may be characterized by high computational throughput and input/output bandwidth, hard real-time response, high reliability and availability, and maintainability, testability, and producibility requirements. In addition, such a system should be affordable to produce, procure, maintain, and upgrade. To address these needs, the Army Fault Tolerant Architecture (AFTA) is being designed and constructed under a three-year program comprised of a conceptual study, detailed design and fabrication, and demonstration and validation phases. Described here are the results of the conceptual study phase of the AFTA development. Given here is an introduction to the AFTA program, its objectives, and key elements of its technical approach. A format is designed for representing mission requirements in a manner suitable for first order AFTA sizing and analysis, followed by a discussion of the current state of mission requirements acquisition for the targeted Army missions. An overview is given of AFTA's architectural theory of operation

    Cross layer reliability estimation for digital systems

    Get PDF
    Forthcoming manufacturing technologies hold the promise to increase multifuctional computing systems performance and functionality thanks to a remarkable growth of the device integration density. Despite the benefits introduced by this technology improvements, reliability is becoming a key challenge for the semiconductor industry. With transistor size reaching the atomic dimensions, vulnerability to unavoidable fluctuations in the manufacturing process and environmental stress rise dramatically. Failing to meet a reliability requirement may add excessive re-design cost to recover and may have severe consequences on the success of a product. %Worst-case design with large margins to guarantee reliable operation has been employed for long time. However, it is reaching a limit that makes it economically unsustainable due to its performance, area, and power cost. One of the open challenges for future technologies is building ``dependable'' systems on top of unreliable components, which will degrade and even fail during normal lifetime of the chip. Conventional design techniques are highly inefficient. They expend significant amount of energy to tolerate the device unpredictability by adding safety margins to a circuit's operating voltage, clock frequency or charge stored per bit. Unfortunately, the additional cost introduced to compensate unreliability are rapidly becoming unacceptable in today's environment where power consumption is often the limiting factor for integrated circuit performance, and energy efficiency is a top concern. Attention should be payed to tailor techniques to improve the reliability of a system on the basis of its requirements, ending up with cost-effective solutions favoring the success of the product on the market. Cross-layer reliability is one of the most promising approaches to achieve this goal. Cross-layer reliability techniques take into account the interactions between the layers composing a complex system (i.e., technology, hardware and software layers) to implement efficient cross-layer fault mitigation mechanisms. Fault tolerance mechanism are carefully implemented at different layers starting from the technology up to the software layer to carefully optimize the system by exploiting the inner capability of each layer to mask lower level faults. For this purpose, cross-layer reliability design techniques need to be complemented with cross-layer reliability evaluation tools, able to precisely assess the reliability level of a selected design early in the design cycle. Accurate and early reliability estimates would enable the exploration of the system design space and the optimization of multiple constraints such as performance, power consumption, cost and reliability. This Ph.D. thesis is devoted to the development of new methodologies and tools to evaluate and optimize the reliability of complex digital systems during the early design stages. More specifically, techniques addressing hardware accelerators (i.e., FPGAs and GPUs), microprocessors and full systems are discussed. All developed methodologies are presented in conjunction with their application to real-world use cases belonging to different computational domains

    Design of a fault tolerant airborne digital computer. Volume 1: Architecture

    Get PDF
    This volume is concerned with the architecture of a fault tolerant digital computer for an advanced commercial aircraft. All of the computations of the aircraft, including those presently carried out by analogue techniques, are to be carried out in this digital computer. Among the important qualities of the computer are the following: (1) The capacity is to be matched to the aircraft environment. (2) The reliability is to be selectively matched to the criticality and deadline requirements of each of the computations. (3) The system is to be readily expandable. contractible, and (4) The design is to appropriate to post 1975 technology. Three candidate architectures are discussed and assessed in terms of the above qualities. Of the three candidates, a newly conceived architecture, Software Implemented Fault Tolerance (SIFT), provides the best match to the above qualities. In addition SIFT is particularly simple and believable. The other candidates, Bus Checker System (BUCS), also newly conceived in this project, and the Hopkins multiprocessor are potentially more efficient than SIFT in the use of redundancy, but otherwise are not as attractive

    Enhanced Compiler Technology for Software-based Hardware Fault Detection

    Get PDF
    Software-Implemented Hardware Fault Tolerance (SIHFT) is a modern approach for tackling random hardware faults of dependable systems employing solely software solutions. This work extends an automatic compiler-based SIHFT hardening tool called ASPIS, enhancing it with novel protection mechanisms and overhead-reduction techniques, also providing an extensive analysis of its compliance with the non-trivial workload of the open-source Real-Time Operating System FreeRTOS. A thorough experimental fault-injection campaign on an STM32 board shows how the system achieves remarkably high tolerance to single-event upsets and a comparison between the SIHFT mechanisms implemented summarises the trade-off between the overhead introduced and the detection capabilities of the various solutions
    • …
    corecore