99 research outputs found

    PTM-based hybrid error-detection architecture for ARM microprocessors

    Get PDF
    This work presents a hybrid error detection architecture that uses ARM PTM trace interface to observe ARM microprocessor behaviour. The proposed approach is suitable for COTS microprocessors because it does not modify the microprocessor architecture and is able to detect errors thanks to the reuse of its trace subsystem. Validation has been performed by proton irradiation and fault injection campaigns on a Zynq AP SoC including a Cortex-A9 ARM microprocessor and an implementation of the proposed hardware monitor in programmable logic. Experimental results demonstrate that a high error detection rate can be achieved on a commercial microprocessor

    Online error detection through trace infrastructure in ARM microprocessors

    Get PDF
    This paper presents a solution for error detection in ARM microprocessors based on the use of the trace infrastructure. This approach uses the Program and Instrumentation Trace Macrocells that are part of ARM's CoreSight architecture to detect control-flow and data-flow errors, respectively. The proposed approach has been tested with low-energy protons. Experimental results demonstrate high accuracy with up to 95% of observed errors detected in a commercial microprocessor with no hardware modification. In addition, it is shown how the proposed approach can be useful for further analysis and diagnosis of the cause of errors

    Error Detection and Diagnosis for System-on-Chip in Space Applications

    Get PDF
    Tesis por compendio de publicacionesLos componentes electrónicos comerciales, comúnmente llamados componentes Commercial-Off-The-Shelf (COTS) están presentes en multitud de dispositivos habituales en nuestro día a día. Particularmente, el uso de microprocesadores y sistemas en chip (SoC) altamente integrados ha favorecido la aparición de dispositivos electrónicos cada vez más inteligentes que sostienen el estilo de vida y el avance de la sociedad moderna. Su uso se ha generalizado incluso en aquellos sistemas que se consideran críticos para la seguridad, como vehículos, aviones, armamento, dispositivos médicos, implantes o centrales eléctricas. En cualquiera de ellos, un fallo podría tener graves consecuencias humanas o económicas. Sin embargo, todos los sistemas electrónicos conviven constantemente con factores internos y externos que pueden provocar fallos en su funcionamiento. La capacidad de un sistema para funcionar correctamente en presencia de fallos se denomina tolerancia a fallos, y es un requisito en el diseño y operación de sistemas críticos. Los vehículos espaciales como satélites o naves espaciales también hacen uso de microprocesadores para operar de forma autónoma o semi autónoma durante su vida útil, con la dificultad añadida de que no pueden ser reparados en órbita, por lo que se consideran sistemas críticos. Además, las duras condiciones existentes en el espacio, y en particular los efectos de la radiación, suponen un gran desafío para el correcto funcionamiento de los dispositivos electrónicos. Concretamente, los fallos transitorios provocados por radiación (conocidos como soft errors) tienen el potencial de ser una de las mayores amenazas para la fiabilidad de un sistema en el espacio. Las misiones espaciales de gran envergadura, típicamente financiadas públicamente como en el caso de la NASA o la Agencia Espacial Europea (ESA), han tenido históricamente como requisito evitar el riesgo a toda costa por encima de cualquier restricción de coste o plazo. Por ello, la selección de componentes resistentes a la radiación (rad-hard) específicamente diseñados para su uso en el espacio ha sido la metodología imperante en el paradigma que hoy podemos denominar industria espacial tradicional, u Old Space. Sin embargo, los componentes rad-hard tienen habitualmente un coste mucho más alto y unas prestaciones mucho menores que otros componentes COTS equivalentes. De hecho, los componentes COTS ya han sido utilizados satisfactoriamente en misiones de la NASA o la ESA cuando las prestaciones requeridas por la misión no podían ser cubiertas por ningún componente rad-hard existente. En los últimos años, el acceso al espacio se está facilitando debido en gran parte a la entrada de empresas privadas en la industria espacial. Estas empresas no siempre buscan evitar el riesgo a toda costa, sino que deben perseguir una rentabilidad económica, por lo que hacen un balance entre riesgo, coste y plazo mediante gestión del riesgo en un paradigma denominado Nuevo Espacio o New Space. Estas empresas a menudo están interesadas en entregar servicios basados en el espacio con las máximas prestaciones y el mayor beneficio posibles, para lo cual los componentes rad-hard son menos atractivos debido a su mayor coste y menores prestaciones que los componentes COTS existentes. Sin embargo, los componentes COTS no han sido específicamente diseñados para su uso en el espacio y típicamente no incluyen técnicas específicas para evitar que los efectos de la radiación afecten su funcionamiento. Los componentes COTS se comercializan tal cual son, y habitualmente no es posible modificarlos para mejorar su resistencia a la radiación. Además, los elevados niveles de integración de los sistemas en chip (SoC) complejos de altas prestaciones dificultan su observación y la aplicación de técnicas de tolerancia a fallos. Este problema es especialmente relevante en el caso de los microprocesadores. Por tanto, existe un gran interés en el desarrollo de técnicas que permitan conocer y mejorar el comportamiento de los microprocesadores COTS bajo radiación sin modificar su arquitectura y sin interferir en su funcionamiento para facilitar su uso en el espacio y con ello maximizar las prestaciones de las misiones espaciales presentes y futuras. En esta Tesis se han desarrollado técnicas novedosas para detectar, diagnosticar y mitigar los errores producidos por radiación en microprocesadores y sistemas en chip (SoC) comerciales, utilizando la interfaz de traza como punto de observación. La interfaz de traza es un recurso habitual en los microprocesadores modernos, principalmente enfocado a soportar las tareas de desarrollo y depuración del software durante la fase de diseño. Sin embargo, una vez el desarrollo ha concluido, la interfaz de traza típicamente no se utiliza durante la fase operativa del sistema, por lo que puede ser reutilizada sin coste. La interfaz de traza constituye un punto de conexión viable para observar el comportamiento de un microprocesador de forma no intrusiva y sin interferir en su funcionamiento. Como resultado de esta Tesis se ha desarrollado un módulo IP capaz de recabar y decodificar la información de traza de un microprocesador COTS moderno de altas prestaciones. El IP es altamente configurable y personalizable para adaptarse a diferentes aplicaciones y tipos de procesadores. Ha sido diseñado y validado utilizando el dispositivo Zynq-7000 de Xilinx como plataforma de desarrollo, que constituye un dispositivo COTS de interés en la industria espacial. Este dispositivo incluye un procesador ARM Cortex-A9 de doble núcleo, que es representativo del conjunto de microprocesadores hard-core modernos de altas prestaciones. El IP resultante es compatible con la tecnología ARM CoreSight, que proporciona acceso a información de traza en los microprocesadores ARM. El IP incorpora técnicas para detectar errores en el flujo de ejecución y en los datos de la aplicación ejecutada utilizando la información de traza, en tiempo real y con muy baja latencia. El IP se ha validado en campañas de inyección de fallos y también en radiación con protones y neutrones en instalaciones especializadas. También se ha combinado con otras técnicas de tolerancia a fallos para construir técnicas híbridas de mitigación de errores. Los resultados experimentales obtenidos demuestran su alta capacidad de detección y potencialidad en el diagnóstico de errores producidos por radiación. El resultado de esta Tesis, desarrollada en el marco de un Doctorado Industrial entre la Universidad Carlos III de Madrid (UC3M) y la empresa Arquimea, se ha transferido satisfactoriamente al entorno empresarial en forma de un proyecto financiado por la Agencia Espacial Europea para continuar su desarrollo y posterior explotación.Commercial electronic components, also known as Commercial-Off-The-Shelf (COTS), are present in a wide variety of devices commonly used in our daily life. Particularly, the use of microprocessors and highly integrated System-on-Chip (SoC) devices has fostered the advent of increasingly intelligent electronic devices which sustain the lifestyles and the progress of modern society. Microprocessors are present even in safety-critical systems, such as vehicles, planes, weapons, medical devices, implants, or power plants. In any of these cases, a fault could involve severe human or economic consequences. However, every electronic system deals continuously with internal and external factors that could provoke faults in its operation. The capacity of a system to operate correctly in presence of faults is known as fault-tolerance, and it becomes a requirement in the design and operation of critical systems. Space vehicles such as satellites or spacecraft also incorporate microprocessors to operate autonomously or semi-autonomously during their service life, with the additional difficulty that they cannot be repaired once in-orbit, so they are considered critical systems. In addition, the harsh conditions in space, and specifically radiation effects, involve a big challenge for the correct operation of electronic devices. In particular, radiation-induced soft errors have the potential to become one of the major risks for the reliability of systems in space. Large space missions, typically publicly funded as in the case of NASA or European Space Agency (ESA), have followed historically the requirement to avoid the risk at any expense, regardless of any cost or schedule restriction. Because of that, the selection of radiation-resistant components (known as rad-hard) specifically designed to be used in space has been the dominant methodology in the paradigm of traditional space industry, also known as “Old Space”. However, rad-hard components have commonly a much higher associated cost and much lower performance that other equivalent COTS devices. In fact, COTS components have already been used successfully by NASA and ESA in missions that requested such high performance that could not be satisfied by any available rad-hard component. In the recent years, the access to space is being facilitated in part due to the irruption of private companies in the space industry. Such companies do not always seek to avoid the risk at any cost, but they must pursue profitability, so they perform a trade-off between risk, cost, and schedule through risk management in a paradigm known as “New Space”. Private companies are often interested in deliver space-based services with the maximum performance and maximum benefit as possible. With such objective, rad-hard components are less attractive than COTS due to their higher cost and lower performance. However, COTS components have not been specifically designed to be used in space and typically they do not include specific techniques to avoid or mitigate the radiation effects in their operation. COTS components are commercialized “as is”, so it is not possible to modify them to improve their susceptibility to radiation effects. Moreover, the high levels of integration of complex, high-performance SoC devices hinder their observability and the application of fault-tolerance techniques. This problem is especially relevant in the case of microprocessors. Thus, there is a growing interest in the development of techniques allowing to understand and improve the behavior of COTS microprocessors under radiation without modifying their architecture and without interfering with their operation. Such techniques may facilitate the use of COTS components in space and maximize the performance of present and future space missions. In this Thesis, novel techniques have been developed to detect, diagnose, and mitigate radiation-induced errors in COTS microprocessors and SoCs using the trace interface as an observation point. The trace interface is a resource commonly found in modern microprocessors, mainly intended to support software development and debugging activities during the design phase. However, it is commonly left unused during the operational phase of the system, so it can be reused with no cost. The trace interface constitutes a feasible connection point to observe microprocessor behavior in a non-intrusive manner and without disturbing processor operation. As a result of this Thesis, an IP module has been developed capable to gather and decode the trace information of a modern, high-end, COTS microprocessor. The IP is highly configurable and customizable to support different applications and processor types. The IP has been designed and validated using the Xilinx Zynq-7000 device as a development platform, which is an interesting COTS device for the space industry. This device features a dual-core ARM Cortex-A9 processor, which is a good representative of modern, high-end, hard-core microprocessors. The resulting IP is compatible with the ARM CoreSight technology, which enables access to trace information in ARM microprocessors. The IP is able to detect errors in the execution flow of the microprocessor and in the application data using trace information, in real time and with very low latency. The IP has been validated in fault injection campaigns and also under proton and neutron irradiation campaigns in specialized facilities. It has also been combined with other fault-tolerance techniques to build hybrid error mitigation approaches. Experimental results demonstrate its high detection capabilities and high potential for the diagnosis of radiation-induced errors. The result of this Thesis, developed in the framework of an Industrial Ph.D. between the University Carlos III of Madrid (UC3M) and the company Arquimea, has been successfully transferred to the company business as a project sponsored by European Space Agency to continue its development and subsequent commercialization.Programa de Doctorado en Ingeniería Eléctrica, Electrónica y Automática por la Universidad Carlos III de MadridPresidenta: María Luisa López Vallejo.- Secretario: Enrique San Millán Heredia.- Vocal: Luigi Di Lill

    Error Detection and Mitigation of Data-Intensive Microprocessor Applications Using SIMD and Trace Monitoring

    Get PDF
    This article proposes a software error mitigation approach that uses the single instruction multiple data (SIMD) coprocessor to accelerate computation over redundant data. In addition, an external IP connected to the microprocessor's trace interface is used to detect errors that are difficult to cover with software-implemented techniques. The proposed approach has been implemented in an ARM microprocessor, and an irradiation campaign with neutrons has been carried out at Los Alamos National Laboratory. Experimental results demonstrate the high error coverage (more than 99.9%) of the proposed approach. The neutron cross section of errors that were not corrected nor detected was reduced by more than three orders of magnitude

    Fault Tolerant Electronic System Design

    Get PDF
    Due to technology scaling, which means reduced transistor size, higher density, lower voltage and more aggressive clock frequency, VLSI devices may become more sensitive against soft errors. Especially for those devices used in safety- and mission-critical applications, dependability and reliability are becoming increasingly important constraints during the development of system on/around them. Other phenomena (e.g., aging and wear-out effects) also have negative impacts on reliability of modern circuits. Recent researches show that even at sea level, radiation particles can still induce soft errors in electronic systems. On one hand, processor-based system are commonly used in a wide variety of applications, including safety-critical and high availability missions, e.g., in the automotive, biomedical and aerospace domains. In these fields, an error may produce catastrophic consequences. Thus, dependability is a primary target that must be achieved taking into account tight constraints in terms of cost, performance, power and time to market. With standards and regulations (e.g., ISO-26262, DO-254, IEC-61508) clearly specify the targets to be achieved and the methods to prove their achievement, techniques working at system level are particularly attracting. On the other hand, Field Programmable Gate Array (FPGA) devices are becoming more and more attractive, also in safety- and mission-critical applications due to the high performance, low power consumption and the flexibility for reconfiguration they provide. Two types of FPGAs are commonly used, based on their configuration memory cell technology, i.e., SRAM-based and Flash-based FPGA. For SRAM-based FPGAs, the SRAM cells of the configuration memory highly susceptible to radiation induced effects which can leads to system failure; and for Flash-based FPGAs, even though their non-volatile configuration memory cells are almost immune to Single Event Upsets induced by energetic particles, the floating gate switches and the logic cells in the configuration tiles can still suffer from Single Event Effects when hit by an highly charged particle. So analysis and mitigation techniques for Single Event Effects on FPGAs are becoming increasingly important in the design flow especially when reliability is one of the main requirements

    Real-time trace decoding and monitoring for safety and security in embedded systems

    Get PDF
    Integrated circuits and systems can be found almost everywhere in today’s world. As their use increases, they need to be made safer and more perfor mant to meet current demands in processing power. FPGA integrated SoCs can provide the ideal trade-off between performance, adaptability, and energy usage. One of today’s vital challenges lies in updating existing fault tolerance techniques for these new systems while utilizing all available processing capa bilities, such as multi-core and heterogeneous processing units. Control-flow monitoring is one of the primary mechanisms described for error detection at the software architectural level for the highest grade of hazard level clas sifications (e.g., ASIL D) described in industry safety standards ISO-26262. Control-flow errors are also known to compose the majority of detected errors for ICs and embedded systems in safety-critical and risk-susceptible environ ments [5]. Software-based monitoring methods remain the most popular [6–8]. However, recent studies show that the overheads they impose make actual reliability gains negligible [9, 10]. This work proposes and demonstrates a new control flow checking method implemented in FPGA for multi-core embedded systems called control-flow trace checker (CFTC). CFTC uses existing trace and debug subsystems of modern processors to rebuild their execution states. It can iden tify any errors in real-time by comparing executed states to a set of permitted state transitions determined statically. This novel implementation weighs hardware resource trade-offs to target mul tiple independent tasks in multi-core embedded applications, as well as single core systems. The proposed system is entirely implemented in hardware and isolated from all monitored software components, requiring 2.4% of the target FPGA platform resources to protect an execution unit in its entirety. There fore, it avoids undesired overheads and maintains deterministic error detection latencies, which guarantees reliability improvements without impairing the target software system. Finally, CFTC is evaluated under different software i Resumo fault-injection scenarios, achieving detection rates of 100% of all control-flow errors to wrong destinations and 98% of all injected faults to program binaries. All detection times are further analyzed and precisely described by a model based on the monitor’s resources and speed and the software application’s control-flow structure and binary characteristics.Circuitos integrados estão presentes em quase todos sistemas complexos do mundo moderno. Conforme sua frequência de uso aumenta, eles precisam se tornar mais seguros e performantes para conseguir atender as novas demandas em potência de processamento. Sistemas em Chip integrados com FPGAs conseguem prover o balanço perfeito entre desempenho, adaptabilidade, e uso de energia. Um dos maiores desafios agora é a necessidade de atualizar técnicas de tolerância à falhas para estes novos sistemas, aproveitando os novos avanços em capacidade de processamento. Monitoramento de fluxo de controle é um dos principais mecanismos para a detecção de erros em nível de software para sistemas classificados como de alto risco (e.g. ASIL D), descrito em padrões de segurança como o ISO-26262. Estes erros são conhecidos por compor a maioria dos erros detectados em sistemas integrados [5]. Embora métodos de monitoramento baseados em software continuem sendo os mais populares [6–8], estudos recentes mostram que seus custos adicionais, em termos de performance e área, diminuem consideravelmente seus ganhos reais em confiabilidade [9, 10]. Propomos aqui um novo método de monitora mento de fluxo de controle implementado em FPGA para sistemas embarcados multi-core. Este método usa subsistemas de trace e execução de código para reconstruir o estado atual do processador, identificando erros através de com parações entre diferentes estados de execução da CPU. Propomos uma implementação que considera trade-offs no uso de recuros de sistema para monitorar múltiplas tarefas independetes. Nossa abordagem suporta o monitoramento de sistemas simples e também de sistemas multi-core multitarefa. Por fim, nossa técnica é totalmente implementada em hardware, evitando o uso de unidades de processamento de software que possa adicionar custos indesejáveis à aplicação em perda de confiabilidade. Propomos, assim, um mecanismo de verificação de fluxo de controle, escalável e extensível, para proteção de sistemas embarcados críticos e multi-core

    Real-Time Trace Decoding and Monitoring for Safety and Security in Embedded Systems

    Get PDF
    Integrated circuits and systems can be found almost everywhere in today’s world. As their use increases, they need to be made safer and more perfor mant to meet current demands in processing power. FPGA integrated SoCs can provide the ideal trade-off between performance, adaptability, and energy usage. One of today’s vital challenges lies in updating existing fault tolerance techniques for these new systems while utilizing all available processing capa bilities, such as multi-core and heterogeneous processing units. Control-flow monitoring is one of the primary mechanisms described for error detection at the software architectural level for the highest grade of hazard level clas sifications (e.g., ASIL D) described in industry safety standards ISO-26262. Control-flow errors are also known to compose the majority of detected errors for ICs and embedded systems in safety-critical and risk-susceptible environ ments [5]. Software-based monitoring methods remain the most popular [6–8]. However, recent studies show that the overheads they impose make actual reliability gains negligible [9, 10]. This work proposes and demonstrates a new control flow checking method implemented in FPGA for multi-core embedded systems called control-flow trace checker (CFTC). CFTC uses existing trace and debug subsystems of modern processors to rebuild their execution states. It can iden tify any errors in real-time by comparing executed states to a set of permitted state transitions determined statically. This novel implementation weighs hardware resource trade-offs to target mul tiple independent tasks in multi-core embedded applications, as well as single core systems. The proposed system is entirely implemented in hardware and isolated from all monitored software components, requiring 2.4% of the target FPGA platform resources to protect an execution unit in its entirety. There fore, it avoids undesired overheads and maintains deterministic error detection latencies, which guarantees reliability improvements without impairing the target software system. Finally, CFTC is evaluated under different software i Resumo fault-injection scenarios, achieving detection rates of 100% of all control-flow errors to wrong destinations and 98% of all injected faults to program binaries. All detection times are further analyzed and precisely described by a model based on the monitor’s resources and speed and the software application’s control-flow structure and binary characteristics.Circuitos integrados estão presentes em quase todos sistemas complexos do mundo moderno. Conforme sua frequência de uso aumenta, eles precisam se tornar mais seguros e performantes para conseguir atender as novas demandas em potência de processamento. Sistemas em Chip integrados com FPGAs conseguem prover o balanço perfeito entre desempenho, adaptabilidade, e uso de energia. Um dos maiores desafios agora é a necessidade de atualizar técnicas de tolerância à falhas para estes novos sistemas, aproveitando os novos avanços em capacidade de processamento. Monitoramento de fluxo de controle é um dos principais mecanismos para a detecção de erros em nível de software para sistemas classificados como de alto risco (e.g. ASIL D), descrito em padrões de segurança como o ISO-26262. Estes erros são conhecidos por compor a maioria dos erros detectados em sistemas integrados [5]. Embora métodos de monitoramento baseados em software continuem sendo os mais populares [6–8], estudos recentes mostram que seus custos adicionais, em termos de performance e área, diminuem consideravelmente seus ganhos reais em confiabilidade [9, 10]. Propomos aqui um novo método de monitora mento de fluxo de controle implementado em FPGA para sistemas embarcados multi-core. Este método usa subsistemas de trace e execução de código para reconstruir o estado atual do processador, identificando erros através de com parações entre diferentes estados de execução da CPU. Propomos uma implementação que considera trade-offs no uso de recuros de sistema para monitorar múltiplas tarefas independetes. Nossa abordagem suporta o monitoramento de sistemas simples e também de sistemas multi-core multitarefa. Por fim, nossa técnica é totalmente implementada em hardware, evitando o uso de unidades de processamento de software que possa adicionar custos indesejáveis à aplicação em perda de confiabilidade. Propomos, assim, um mecanismo de verificação de fluxo de controle, escalável e extensível, para proteção de sistemas embarcados críticos e multi-core

    SyRA: early system reliability analysis for cross-layer soft errors resilience in memory arrays of microprocessor systems

    Get PDF
    © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Cross-layer reliability is becoming the preferred solution when reliability is a concern in the design of a microprocessor-based system. Nevertheless, deciding how to distribute the error management across the different layers of the system is a very complex task that requires the support of dedicated frameworks for cross-layer reliability analysis. This paper proposes SyRA, a system-level cross-layer early reliability analysis framework for radiation induced soft errors in memory arrays of microprocessor-based systems. The framework exploits a multi-level hybrid Bayesian model to describe the target system and takes advantage of Bayesian inference to estimate different reliability metrics. SyRA implements several mechanisms and features to deal with the complexity of realistic models and implements a complete tool-chain that scales efficiently with the complexity of the system. The simulation time is significantly lower than micro-architecture level or RTL fault-injection experiments with an accuracy high enough to take effective design decisions. To demonstrate the capability of SyRA, we analyzed the reliability of a set of microprocessor-based systems characterized by different microprocessor architectures (i.e., Intel x86, ARM Cortex-A15, ARM Cortex-A9) running both the Linux operating system or bare metal. Each system under analysis executes different software workloads both from benchmark suites and from real applications.Peer ReviewedPostprint (author's final draft

    Multilevel Runtime Verification for Safety and Security Critical Cyber Physical Systems from a Model Based Engineering Perspective

    Get PDF
    Advanced embedded system technology is one of the key driving forces behind the rapid growth of Cyber-Physical System (CPS) applications. CPS consists of multiple coordinating and cooperating components, which are often software-intensive and interact with each other to achieve unprecedented tasks. Such highly integrated CPSs have complex interaction failures, attack surfaces, and attack vectors that we have to protect and secure against. This dissertation advances the state-of-the-art by developing a multilevel runtime monitoring approach for safety and security critical CPSs where there are monitors at each level of processing and integration. Given that computation and data processing vulnerabilities may exist at multiple levels in an embedded CPS, it follows that solutions present at the levels where the faults or vulnerabilities originate are beneficial in timely detection of anomalies. Further, increasing functional and architectural complexity of critical CPSs have significant safety and security operational implications. These challenges are leading to a need for new methods where there is a continuum between design time assurance and runtime or operational assurance. Towards this end, this dissertation explores Model Based Engineering methods by which design assurance can be carried forward to the runtime domain, creating a shared responsibility for reducing the overall risk associated with the system at operation. Therefore, a synergistic combination of Verification & Validation at design time and runtime monitoring at multiple levels is beneficial in assuring safety and security of critical CPS. Furthermore, we realize our multilevel runtime monitor framework on hardware using a stream-based runtime verification language

    Observation mechanisms for in-field software-based self-test

    Get PDF
    When electronic systems are used in safety critical applications, as in the space, avionic, automotive or biomedical areas, it is required to maintain a very low probability of failures due to faults of any kind. Standards and regulations play a significant role, forcing companies to devise and adopt solutions able to achieve predefined targets in terms of dependability. Different techniques can be used to reduce fault occurrence or to minimize the probability that those faults produce critical failures (e.g., by introducing redundancy). Unfortunately, most of these techniques have a severe impact on the cost of the resulting product and, in some cases, the probability of failures is too large anyway. Hence, a solution commonly used in several scenarios lies on periodically performing a test able to detect the occurrence of any fault before it produces a failure (in-field test). This solution is normally based on forcing the processor inside the Device Under Test to execute a properly written test program, which is able to activate possible faults and to make their effects visible in some observable locations. This approach is also called Software-Based Self-Test, or SBST. If compared with testing in an end of manufacturing scenario, in-field testing has strong limitations in terms of access to the system inputs and outputs because Design for Testability structures and testing equipment are usually not available. As a consequence there are reduced possibilities to activate the faults and to observe their effects. This reduced observability particularly affects the ability to detect performance faults, i.e. faults that modify the timing but not the final value of computations. This kind of faults are hard to detect by only observing the final content of predefined memory locations, that is the usual test result observation method used in-field. Initially, the present work was focused on fault tolerance techniques against transient faults induced by ionizing radiation, the so called Single Event Upsets (SEUs). The main contribution of this early stage of the thesis lies in the experimental validation of the feasibility of achieving a safe system by using an architecture that combines task-level redundancy with already available IP cores, thus minimizing the development time. Task execution is replicated and Memory Protection is used to guarantee that any SEU may affect one and only one of the replicas. A proof of concept implementation was developed and validated using fault injection. Results outline the effectiveness of the architecture, and the overhead analysis shows that the proposed architecture is effective in reducing the resource occupation with respect to N-modular redundancy, at an affordable cost in terms of application execution time. The main part of the thesis is focused on in-field software-based self-test of permanent faults. A set of observation methods exploiting existing or ad-hoc hardware is proposed, aimed at obtaining a better coverage, in particular of performance faults. An extensive quantitative evaluation of the proposed methods is presented, including a comparison with the observation methods traditionally used in end of manufacturing and in-field testing. Results show that the proposed methods are a good complement to the traditionally used final memory content observation. Moreover, they show that an adequate combination of these complementary methods allows for achieving nearly the same fault coverage achieved when continuously observing all the processor outputs, which is an observation method commonly used for production test but usually not available in-field. A very interesting by-product of what is described above is a detailed description of how to compute the fault coverage achieved by functional in-field tests using a conventional fault simulator, a tool that is usually applied in an end of manufacturing testing scenario. Finally, another relevant result in the testing area is a method to detect permanent faults inside the cache coherence logic integrated in each cache controller of a multi-core system, based on the concurrent execution of a test program by the different cores in a coordinated manner. By construction, the method achieves full fault coverage of the static faults in the addressed logic.Cuando se utilizan sistemas electrónicos en aplicaciones críticas como en las áreas biomédica, aeroespacial o automotriz, se requiere mantener una muy baja probabilidad de malfuncionamientos debidos a cualquier tipo de fallas. Los estándares y normas juegan un papel importante, forzando a los desarrolladores a diseñar y adoptar soluciones que sean capaces de alcanzar objetivos predefinidos en cuanto a seguridad y confiabilidad. Pueden utilizarse diferentes técnicas para reducir la ocurrencia de fallas o para minimizar la probabilidad de que esas fallas produzcan mal funcionamientos críticos, por ejemplo a través de la incorporación de redundancia. Lamentablemente, muchas de esas técnicas afectan en gran medida el costo de los productos y, en algunos casos, la probabilidad de malfuncionamiento sigue siendo demasiado alta. En consecuencia, una solución usada a menudo en varios escenarios consiste en realizar periódicamente un test que sea capaz de detectar la ocurrencia de una falla antes de que esta produzca un mal funcionamiento (test en campo). En general, esta solución se basa en forzar a un procesador existente dentro del dispositivo bajo prueba a ejecutar un programa de test que sea capaz de activar las posibles fallas y de hacer que sus efectos sean visibles en puntos observables. A esta metodología también se la llama auto-test basado en software, o en inglés Software-Based Self-Test (SBST). Si se lo compara con un escenario de test de fin de fabricación, el test en campo tiene fuertes limitaciones en términos de posibilidad de acceso a las entradas y salidas del sistema, porque usualmente no se dispone de equipamiento de test ni de la infraestructura de Design for Testability. En consecuencia se tiene menos posibilidades de activar las fallas y de observar sus efectos. Esta observabilidad reducida afecta particularmente la habilidad para detectar fallas de performance, es decir fallas que modifican la temporización pero no el resultado final de los cálculos. Este tipo de fallas es difícil de detectar por la sola observación del contenido final de lugares de memoria, que es el método usual que se utiliza para observar los resultados de un test en campo. Inicialmente, el presente trabajo estuvo enfocado en técnicas para tolerar fallas transitorias inducidas por radiación ionizante, llamadas en inglés Single Event Upsets (SEUs). La principal contribución de esa etapa inicial de la tesis reside en la validación experimental de la viabilidad de obtener un sistema seguro, utilizando una arquitectura que combina redundancia a nivel de tareas con el uso de módulos hardware (IP cores) ya disponibles, que minimiza en consecuencia el tiempo de desarrollo. Se replica la ejecución de las tareas y se utiliza protección de memoria para garantizar que un SEU pueda afectar a lo sumo a una sola de las réplicas. Se desarrolló una implementación para prueba de concepto que fue validada mediante inyección de fallas. Los resultados muestran la efectividad de la arquitectura, y el análisis de los recursos utilizados muestra que la arquitectura propuesta es efectiva en reducir la ocupación con respecto a la redundancia modular con N réplicas, a un costo accesible en términos de tiempo de ejecución. La parte principal de esta tesis se enfoca en el área de auto-test en campo basado en software para la detección de fallas permanentes. Se propone un conjunto de métodos de observación utilizando hardware existente o ad-hoc, con el fin de obtener una mejor cobertura, en particular de las fallas de performance. Se presenta una extensa evaluación cuantitativa de los métodos propuestos, que incluye una comparación con los métodos tradicionalmente utilizados en tests de fin de fabricación y en campo. Los resultados muestran que los métodos propuestos son un buen complemento del método tradicionalmente usado que consiste en observar el valor final del contenido de memoria. Además muestran que una adecuada combinación de estos métodos complementarios permite alcanzar casi los mismos valores de cobertura de fallas que se obtienen mediante la observación continua de todas las salidas del procesador, método comúnmente usado en tests de fin de fabricación, pero que usualmente no está disponible en campo. Un subproducto muy interesante de lo arriba expuesto es la descripción detallada del procedimiento para calcular la cobertura de fallas lograda mediante tests funcionales en campo por medio de un simulador de fallas convencional, una herramienta que usualmente se aplica en escenarios de test de fin de fabricación. Finalmente, otro resultado relevante en el área de test es un método para detectar fallas permanentes dentro de la lógica de coherencia de cache que está integrada en el controlador de cache de cada procesador en un sistema multi procesador. El método está basado en la ejecución de un programa de test en forma coordinada por parte de los diferentes procesadores. Por construcción, el método cubre completamente las fallas de la lógica mencionad
    corecore