669 research outputs found

    Error Detection and Diagnosis for System-on-Chip in Space Applications

    Get PDF
    Tesis por compendio de publicacionesLos componentes electrónicos comerciales, comúnmente llamados componentes Commercial-Off-The-Shelf (COTS) están presentes en multitud de dispositivos habituales en nuestro día a día. Particularmente, el uso de microprocesadores y sistemas en chip (SoC) altamente integrados ha favorecido la aparición de dispositivos electrónicos cada vez más inteligentes que sostienen el estilo de vida y el avance de la sociedad moderna. Su uso se ha generalizado incluso en aquellos sistemas que se consideran críticos para la seguridad, como vehículos, aviones, armamento, dispositivos médicos, implantes o centrales eléctricas. En cualquiera de ellos, un fallo podría tener graves consecuencias humanas o económicas. Sin embargo, todos los sistemas electrónicos conviven constantemente con factores internos y externos que pueden provocar fallos en su funcionamiento. La capacidad de un sistema para funcionar correctamente en presencia de fallos se denomina tolerancia a fallos, y es un requisito en el diseño y operación de sistemas críticos. Los vehículos espaciales como satélites o naves espaciales también hacen uso de microprocesadores para operar de forma autónoma o semi autónoma durante su vida útil, con la dificultad añadida de que no pueden ser reparados en órbita, por lo que se consideran sistemas críticos. Además, las duras condiciones existentes en el espacio, y en particular los efectos de la radiación, suponen un gran desafío para el correcto funcionamiento de los dispositivos electrónicos. Concretamente, los fallos transitorios provocados por radiación (conocidos como soft errors) tienen el potencial de ser una de las mayores amenazas para la fiabilidad de un sistema en el espacio. Las misiones espaciales de gran envergadura, típicamente financiadas públicamente como en el caso de la NASA o la Agencia Espacial Europea (ESA), han tenido históricamente como requisito evitar el riesgo a toda costa por encima de cualquier restricción de coste o plazo. Por ello, la selección de componentes resistentes a la radiación (rad-hard) específicamente diseñados para su uso en el espacio ha sido la metodología imperante en el paradigma que hoy podemos denominar industria espacial tradicional, u Old Space. Sin embargo, los componentes rad-hard tienen habitualmente un coste mucho más alto y unas prestaciones mucho menores que otros componentes COTS equivalentes. De hecho, los componentes COTS ya han sido utilizados satisfactoriamente en misiones de la NASA o la ESA cuando las prestaciones requeridas por la misión no podían ser cubiertas por ningún componente rad-hard existente. En los últimos años, el acceso al espacio se está facilitando debido en gran parte a la entrada de empresas privadas en la industria espacial. Estas empresas no siempre buscan evitar el riesgo a toda costa, sino que deben perseguir una rentabilidad económica, por lo que hacen un balance entre riesgo, coste y plazo mediante gestión del riesgo en un paradigma denominado Nuevo Espacio o New Space. Estas empresas a menudo están interesadas en entregar servicios basados en el espacio con las máximas prestaciones y el mayor beneficio posibles, para lo cual los componentes rad-hard son menos atractivos debido a su mayor coste y menores prestaciones que los componentes COTS existentes. Sin embargo, los componentes COTS no han sido específicamente diseñados para su uso en el espacio y típicamente no incluyen técnicas específicas para evitar que los efectos de la radiación afecten su funcionamiento. Los componentes COTS se comercializan tal cual son, y habitualmente no es posible modificarlos para mejorar su resistencia a la radiación. Además, los elevados niveles de integración de los sistemas en chip (SoC) complejos de altas prestaciones dificultan su observación y la aplicación de técnicas de tolerancia a fallos. Este problema es especialmente relevante en el caso de los microprocesadores. Por tanto, existe un gran interés en el desarrollo de técnicas que permitan conocer y mejorar el comportamiento de los microprocesadores COTS bajo radiación sin modificar su arquitectura y sin interferir en su funcionamiento para facilitar su uso en el espacio y con ello maximizar las prestaciones de las misiones espaciales presentes y futuras. En esta Tesis se han desarrollado técnicas novedosas para detectar, diagnosticar y mitigar los errores producidos por radiación en microprocesadores y sistemas en chip (SoC) comerciales, utilizando la interfaz de traza como punto de observación. La interfaz de traza es un recurso habitual en los microprocesadores modernos, principalmente enfocado a soportar las tareas de desarrollo y depuración del software durante la fase de diseño. Sin embargo, una vez el desarrollo ha concluido, la interfaz de traza típicamente no se utiliza durante la fase operativa del sistema, por lo que puede ser reutilizada sin coste. La interfaz de traza constituye un punto de conexión viable para observar el comportamiento de un microprocesador de forma no intrusiva y sin interferir en su funcionamiento. Como resultado de esta Tesis se ha desarrollado un módulo IP capaz de recabar y decodificar la información de traza de un microprocesador COTS moderno de altas prestaciones. El IP es altamente configurable y personalizable para adaptarse a diferentes aplicaciones y tipos de procesadores. Ha sido diseñado y validado utilizando el dispositivo Zynq-7000 de Xilinx como plataforma de desarrollo, que constituye un dispositivo COTS de interés en la industria espacial. Este dispositivo incluye un procesador ARM Cortex-A9 de doble núcleo, que es representativo del conjunto de microprocesadores hard-core modernos de altas prestaciones. El IP resultante es compatible con la tecnología ARM CoreSight, que proporciona acceso a información de traza en los microprocesadores ARM. El IP incorpora técnicas para detectar errores en el flujo de ejecución y en los datos de la aplicación ejecutada utilizando la información de traza, en tiempo real y con muy baja latencia. El IP se ha validado en campañas de inyección de fallos y también en radiación con protones y neutrones en instalaciones especializadas. También se ha combinado con otras técnicas de tolerancia a fallos para construir técnicas híbridas de mitigación de errores. Los resultados experimentales obtenidos demuestran su alta capacidad de detección y potencialidad en el diagnóstico de errores producidos por radiación. El resultado de esta Tesis, desarrollada en el marco de un Doctorado Industrial entre la Universidad Carlos III de Madrid (UC3M) y la empresa Arquimea, se ha transferido satisfactoriamente al entorno empresarial en forma de un proyecto financiado por la Agencia Espacial Europea para continuar su desarrollo y posterior explotación.Commercial electronic components, also known as Commercial-Off-The-Shelf (COTS), are present in a wide variety of devices commonly used in our daily life. Particularly, the use of microprocessors and highly integrated System-on-Chip (SoC) devices has fostered the advent of increasingly intelligent electronic devices which sustain the lifestyles and the progress of modern society. Microprocessors are present even in safety-critical systems, such as vehicles, planes, weapons, medical devices, implants, or power plants. In any of these cases, a fault could involve severe human or economic consequences. However, every electronic system deals continuously with internal and external factors that could provoke faults in its operation. The capacity of a system to operate correctly in presence of faults is known as fault-tolerance, and it becomes a requirement in the design and operation of critical systems. Space vehicles such as satellites or spacecraft also incorporate microprocessors to operate autonomously or semi-autonomously during their service life, with the additional difficulty that they cannot be repaired once in-orbit, so they are considered critical systems. In addition, the harsh conditions in space, and specifically radiation effects, involve a big challenge for the correct operation of electronic devices. In particular, radiation-induced soft errors have the potential to become one of the major risks for the reliability of systems in space. Large space missions, typically publicly funded as in the case of NASA or European Space Agency (ESA), have followed historically the requirement to avoid the risk at any expense, regardless of any cost or schedule restriction. Because of that, the selection of radiation-resistant components (known as rad-hard) specifically designed to be used in space has been the dominant methodology in the paradigm of traditional space industry, also known as “Old Space”. However, rad-hard components have commonly a much higher associated cost and much lower performance that other equivalent COTS devices. In fact, COTS components have already been used successfully by NASA and ESA in missions that requested such high performance that could not be satisfied by any available rad-hard component. In the recent years, the access to space is being facilitated in part due to the irruption of private companies in the space industry. Such companies do not always seek to avoid the risk at any cost, but they must pursue profitability, so they perform a trade-off between risk, cost, and schedule through risk management in a paradigm known as “New Space”. Private companies are often interested in deliver space-based services with the maximum performance and maximum benefit as possible. With such objective, rad-hard components are less attractive than COTS due to their higher cost and lower performance. However, COTS components have not been specifically designed to be used in space and typically they do not include specific techniques to avoid or mitigate the radiation effects in their operation. COTS components are commercialized “as is”, so it is not possible to modify them to improve their susceptibility to radiation effects. Moreover, the high levels of integration of complex, high-performance SoC devices hinder their observability and the application of fault-tolerance techniques. This problem is especially relevant in the case of microprocessors. Thus, there is a growing interest in the development of techniques allowing to understand and improve the behavior of COTS microprocessors under radiation without modifying their architecture and without interfering with their operation. Such techniques may facilitate the use of COTS components in space and maximize the performance of present and future space missions. In this Thesis, novel techniques have been developed to detect, diagnose, and mitigate radiation-induced errors in COTS microprocessors and SoCs using the trace interface as an observation point. The trace interface is a resource commonly found in modern microprocessors, mainly intended to support software development and debugging activities during the design phase. However, it is commonly left unused during the operational phase of the system, so it can be reused with no cost. The trace interface constitutes a feasible connection point to observe microprocessor behavior in a non-intrusive manner and without disturbing processor operation. As a result of this Thesis, an IP module has been developed capable to gather and decode the trace information of a modern, high-end, COTS microprocessor. The IP is highly configurable and customizable to support different applications and processor types. The IP has been designed and validated using the Xilinx Zynq-7000 device as a development platform, which is an interesting COTS device for the space industry. This device features a dual-core ARM Cortex-A9 processor, which is a good representative of modern, high-end, hard-core microprocessors. The resulting IP is compatible with the ARM CoreSight technology, which enables access to trace information in ARM microprocessors. The IP is able to detect errors in the execution flow of the microprocessor and in the application data using trace information, in real time and with very low latency. The IP has been validated in fault injection campaigns and also under proton and neutron irradiation campaigns in specialized facilities. It has also been combined with other fault-tolerance techniques to build hybrid error mitigation approaches. Experimental results demonstrate its high detection capabilities and high potential for the diagnosis of radiation-induced errors. The result of this Thesis, developed in the framework of an Industrial Ph.D. between the University Carlos III of Madrid (UC3M) and the company Arquimea, has been successfully transferred to the company business as a project sponsored by European Space Agency to continue its development and subsequent commercialization.Programa de Doctorado en Ingeniería Eléctrica, Electrónica y Automática por la Universidad Carlos III de MadridPresidenta: María Luisa López Vallejo.- Secretario: Enrique San Millán Heredia.- Vocal: Luigi Di Lill

    Online error detection through trace infrastructure in ARM microprocessors

    Get PDF
    This paper presents a solution for error detection in ARM microprocessors based on the use of the trace infrastructure. This approach uses the Program and Instrumentation Trace Macrocells that are part of ARM's CoreSight architecture to detect control-flow and data-flow errors, respectively. The proposed approach has been tested with low-energy protons. Experimental results demonstrate high accuracy with up to 95% of observed errors detected in a commercial microprocessor with no hardware modification. In addition, it is shown how the proposed approach can be useful for further analysis and diagnosis of the cause of errors

    PTM-based hybrid error-detection architecture for ARM microprocessors

    Get PDF
    This work presents a hybrid error detection architecture that uses ARM PTM trace interface to observe ARM microprocessor behaviour. The proposed approach is suitable for COTS microprocessors because it does not modify the microprocessor architecture and is able to detect errors thanks to the reuse of its trace subsystem. Validation has been performed by proton irradiation and fault injection campaigns on a Zynq AP SoC including a Cortex-A9 ARM microprocessor and an implementation of the proposed hardware monitor in programmable logic. Experimental results demonstrate that a high error detection rate can be achieved on a commercial microprocessor

    Selective SWIFT-R. A Flexible Software-Based Technique for Soft Error Mitigation in Low-Cost Embedded Systems

    Get PDF
    Commercial off-the-shelf microprocessors are the core of low-cost embedded systems due to their programmability and cost-effectiveness. Recent advances in electronic technologies have allowed remarkable improvements in their performance. However, they have also made microprocessors more susceptible to transient faults induced by radiation. These non-destructive events (soft errors), may cause a microprocessor to produce a wrong computation result or lose control of a system with catastrophic consequences. Therefore, soft error mitigation has become a compulsory requirement for an increasing number of applications, which operate from the space to the ground level. In this context, this paper uses the concept of selective hardening, which is aimed to design reduced-overhead and flexible mitigation techniques. Following this concept, a novel flexible version of the software-based fault recovery technique known as SWIFT-R is proposed. Our approach makes possible to select different registers subsets from the microprocessor register file to be protected on software. Thus, design space is enriched with a wide spectrum of new partially protected versions, which offer more flexibility to designers. This permits to find the best trade-offs between performance, code size, and fault coverage. Three case studies have been developed to show the applicability and flexibility of the proposal.This work was funded by the Ministry of Science and Innovation in Spain with the project ‘RENASER+: Integral Analysis of Digital Circuits and Systems for Aerospace Applications’ (TEC2010-22095-C03-01)

    Multi-core devices for safety-critical systems: a survey

    Get PDF
    Multi-core devices are envisioned to support the development of next-generation safety-critical systems, enabling the on-chip integration of functions of different criticality. This integration provides multiple system-level potential benefits such as cost, size, power, and weight reduction. However, safety certification becomes a challenge and several fundamental safety technical requirements must be addressed, such as temporal and spatial independence, reliability, and diagnostic coverage. This survey provides a categorization and overview at different device abstraction levels (nanoscale, component, and device) of selected key research contributions that support the compliance with these fundamental safety requirements.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness under grant TIN2015-65316-P, Basque Government under grant KK-2019-00035 and the HiPEAC Network of Excellence. The Spanish Ministry of Economy and Competitiveness has also partially supported Jaume Abella under Ramon y Cajal postdoctoral fellowship (RYC-2013-14717).Peer ReviewedPostprint (author's final draft

    Radiation Testing of a Multiprocessor Macrosynchronized Lockstep Architecture With FreeRTOS

    Get PDF
    Nowadays, high-performance microprocessors are demanded in many fields, including those with high-reliability requirements. Commercial microprocessors present a good tradeoff between cost, size, and performance, albeit they must be adapted to satisfy the reliability requirements when they are used in harsh environments. This work presents a high-end multiprocessor hardened with macrosynchronized lockstep and additional protections. A commercial dual-core Advanced RISC Machine (ARM) cortex A9 has been used as a case study and a complete hardened system has been developed. Evaluation of the proposed hardened system has been accomplished with exhaustive fault injection campaigns and proton irradiation. The hardening approach has been accomplished for both baremetal applications and operating system (OS)-based. The hardened system has demonstrated high reliability in all performed experiments with error coverage up to 99.3% in the irradiation experiments. Experimental irradiation results demonstrate a cross-sectional reduction of two orders of magnitude.This work was supported in part by the Spanish Ministry of Science and Innovation under Project PID2019-106455GB-C21 and in part by the Community of Madrid under Project 49.520608.9.18Publicad

    Implementation and Characterization of AHR on a Xilinx FPGA

    Get PDF
    A new version of the Adaptive-Hybrid Redundancy (AHR) architecture was developed to be implemented and tested in hardware using Commercial-Off-The-Shelf (COTS) Field-Programmable Gate Arrays (FPGAs). The AHR architecture was developed to mitigate the effects that the Single Event Upset (SEU) and Single Event Transient (SET) radiation effects have on processors and was tested on a Microprocessor without Interlocked Pipeline Stages (MIPS) architecture. The AHR MIPS architecture was implemented in hardware using two Xilinx FPGAs. A Universal Asynchronous Receiver Transmitter (UART) based serial communication network was added to the AHR MIPS design to enable inter-board communication between the two FPGAs. The runtime performance of AHR MIPS was measured in hardware and compared against the runtime performance of standalone TMR and TSR MIPS architectures. The hardware implementation of AHR MIPS demonstrated flexible runtime performance that was nearly as fast as TMR MIPS, never as slow as TSR MIPS, and demonstrated performance in between those extremes. Hardware testing and verification of AHR MIPS showed that the AHR mitigation strategy presents a large performance tradespace, where a user can adjust both the runtime processor performance and radiation tolerance to fit the constraints of a space mission, while also continuing to provide adaptive performance based upon the current radiation environment
    corecore