33 research outputs found

    CONCURRENT DIAGNOSTICS IN MULTIPROCESSOR SYSTEMS

    Get PDF
    The paper presents a survey of diagnostic methods for multiprocessor systems. The diagnostic means known so far are first summarized and evaluated from the point of view of their applicability to systems with distributed control and specifically to the multiprocessor systems. A combination of different diagnostic means is then suggested in order to achieve the maximum diagnostic coverage with minimum overhead

    Compiler-Assisted Signature Monitoring

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryJoint Services Electronics Program / N00014-84-C-0149Office of Naval Research / N00014-88-K-0656National Science Foundation / MIP-8809478NCRNational Aeronautics and Space Administration / NASA NAG 1-61

    The Watchdog Task: Concurrent error detection using assertions

    Get PDF
    The Watchdog Task, a software abstraction of the Watchdog-processor, is shown to be a powerful error detection tool with a great deal of flexibility and the advantages of watchdog techniques. A Watchdog Task system in Ada is presented; issues of recovery, latency, efficiency (communication) and preprocessing are discussed. Different applications, one of which is error detection on a single processor, are examined

    E-QED: Electrical Bug Localization During Post-Silicon Validation Enabled by Quick Error Detection and Formal Methods

    Full text link
    During post-silicon validation, manufactured integrated circuits are extensively tested in actual system environments to detect design bugs. Bug localization involves identification of a bug trace (a sequence of inputs that activates and detects the bug) and a hardware design block where the bug is located. Existing bug localization practices during post-silicon validation are mostly manual and ad hoc, and, hence, extremely expensive and time consuming. This is particularly true for subtle electrical bugs caused by unexpected interactions between a design and its electrical state. We present E-QED, a new approach that automatically localizes electrical bugs during post-silicon validation. Our results on the OpenSPARC T2, an open-source 500-million-transistor multicore chip design, demonstrate the effectiveness and practicality of E-QED: starting with a failed post-silicon test, in a few hours (9 hours on average) we can automatically narrow the location of the bug to (the fan-in logic cone of) a handful of candidate flip-flops (18 flip-flops on average for a design with ~ 1 Million flip-flops) and also obtain the corresponding bug trace. The area impact of E-QED is ~2.5%. In contrast, deter-mining this same information might take weeks (or even months) of mostly manual work using traditional approaches

    Method and apparatus for fault tolerance

    Get PDF
    A method and apparatus for achieving fault tolerance in a computer system having at least a first central processing unit and a second central processing unit. The method comprises the steps of first executing a first algorithm in the first central processing unit on input which produces a first output as well as a certification trail. Next, executing a second algorithm in the second central processing unit on the input and on at least a portion of the certification trail which produces a second output. The second algorithm has a faster execution time than the first algorithm for a given input. Then, comparing the first and second outputs such that an error result is produced if the first and second outputs are not the same. The step of executing a first algorithm and the step of executing a second algorithm preferably takes place over essentially the same time period

    Méthode de conception d'architectures tolérantes aux fautes transitoires en milieu spatial : Application à une chaîne de compression d'images

    Get PDF
    Ce papier présente une approche de l'adéquation algorithme-architecture pour les systèmes embarqués en milieu spatial. La tolérance aux fautes est traitée au niveau algorithmique et architectural, autour de composants commerciaux COTS3. Nous présentons le contexte de l'étude et une approche originale de la détection d'erreur par analyse de signature qui exploite les caractéristiques déterministes des algorithmes TDSI

    Software implemented fault tolerance for microprocessor controllers: fault tolerance for microprocessor controllers

    Get PDF
    It is generally accepted that transient faults are a major cause of failure in micro processor systems. Industrial controllers with embedded microprocessors are particularly at risk from this type of failure because their working environments are prone to transient disturbances which can generate transient faults. In order to improve the reliability of processor systems for industrial applications within a limited budget, fault tolerant techniques for uniprocessors are implemented. These techniques aim to identify characteristics of processor operation which are attributed to erroneous behaviour. Once detection is achieved, a programme of restoration activity can be initiated. This thesis initially develops a previous model of erroneous microprocessor behaviour from which characteristics particular to mal-operation are identified. A new technique is proposed, based on software implemented fault tolerance which, by recognizing a particular behavioural characteristic, facilitates the self-detection of erroneous execution. The technique involves inserting detection mechanisms into the target software. This can be quite a complex process and so a prototype software tool called Post-programming Automated Recovery UTility (PARUT) is developed to automate the technique's application. The utility can be used to apply the proposed behavioural fault tolerant technique for a selection of target processors. Fault injection and emulation experiments assess the effectiveness of the proposed fault tolerant technique for three application programs implemented on an 8, 16, and 32- bit processors respectively. The modified application programs are shown to have an improved detection capability and hence reliability when the proposed fault tolerant technique is applied. General assessment of the technique cannot be made, however, because its effectiveness is application specific. The thesis concludes by considering methods of generating non-hazardous application programs at the compilation stage, and design features for incorporation into the architecture of a microprocessor which inherently reduce the hazard, and increase the detection capability of the target software. Particular suggestions are made to add a 'PARUT' phase to the translation process, and to orientate microprocessor design towards the instruction opcode map

    Real-time trace decoding and monitoring for safety and security in embedded systems

    Get PDF
    Integrated circuits and systems can be found almost everywhere in today’s world. As their use increases, they need to be made safer and more perfor mant to meet current demands in processing power. FPGA integrated SoCs can provide the ideal trade-off between performance, adaptability, and energy usage. One of today’s vital challenges lies in updating existing fault tolerance techniques for these new systems while utilizing all available processing capa bilities, such as multi-core and heterogeneous processing units. Control-flow monitoring is one of the primary mechanisms described for error detection at the software architectural level for the highest grade of hazard level clas sifications (e.g., ASIL D) described in industry safety standards ISO-26262. Control-flow errors are also known to compose the majority of detected errors for ICs and embedded systems in safety-critical and risk-susceptible environ ments [5]. Software-based monitoring methods remain the most popular [6–8]. However, recent studies show that the overheads they impose make actual reliability gains negligible [9, 10]. This work proposes and demonstrates a new control flow checking method implemented in FPGA for multi-core embedded systems called control-flow trace checker (CFTC). CFTC uses existing trace and debug subsystems of modern processors to rebuild their execution states. It can iden tify any errors in real-time by comparing executed states to a set of permitted state transitions determined statically. This novel implementation weighs hardware resource trade-offs to target mul tiple independent tasks in multi-core embedded applications, as well as single core systems. The proposed system is entirely implemented in hardware and isolated from all monitored software components, requiring 2.4% of the target FPGA platform resources to protect an execution unit in its entirety. There fore, it avoids undesired overheads and maintains deterministic error detection latencies, which guarantees reliability improvements without impairing the target software system. Finally, CFTC is evaluated under different software i Resumo fault-injection scenarios, achieving detection rates of 100% of all control-flow errors to wrong destinations and 98% of all injected faults to program binaries. All detection times are further analyzed and precisely described by a model based on the monitor’s resources and speed and the software application’s control-flow structure and binary characteristics.Circuitos integrados estão presentes em quase todos sistemas complexos do mundo moderno. Conforme sua frequência de uso aumenta, eles precisam se tornar mais seguros e performantes para conseguir atender as novas demandas em potência de processamento. Sistemas em Chip integrados com FPGAs conseguem prover o balanço perfeito entre desempenho, adaptabilidade, e uso de energia. Um dos maiores desafios agora é a necessidade de atualizar técnicas de tolerância à falhas para estes novos sistemas, aproveitando os novos avanços em capacidade de processamento. Monitoramento de fluxo de controle é um dos principais mecanismos para a detecção de erros em nível de software para sistemas classificados como de alto risco (e.g. ASIL D), descrito em padrões de segurança como o ISO-26262. Estes erros são conhecidos por compor a maioria dos erros detectados em sistemas integrados [5]. Embora métodos de monitoramento baseados em software continuem sendo os mais populares [6–8], estudos recentes mostram que seus custos adicionais, em termos de performance e área, diminuem consideravelmente seus ganhos reais em confiabilidade [9, 10]. Propomos aqui um novo método de monitora mento de fluxo de controle implementado em FPGA para sistemas embarcados multi-core. Este método usa subsistemas de trace e execução de código para reconstruir o estado atual do processador, identificando erros através de com parações entre diferentes estados de execução da CPU. Propomos uma implementação que considera trade-offs no uso de recuros de sistema para monitorar múltiplas tarefas independetes. Nossa abordagem suporta o monitoramento de sistemas simples e também de sistemas multi-core multitarefa. Por fim, nossa técnica é totalmente implementada em hardware, evitando o uso de unidades de processamento de software que possa adicionar custos indesejáveis à aplicação em perda de confiabilidade. Propomos, assim, um mecanismo de verificação de fluxo de controle, escalável e extensível, para proteção de sistemas embarcados críticos e multi-core
    corecore