353 research outputs found

    DeSyRe: on-Demand System Reliability

    No full text
    The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints

    On the use of embedded debug features for permanent and transient fault resilience in microprocessors

    Get PDF
    Microprocessor-based systems are employed in an increasing number of applications where dependability is a major constraint. For this reason detecting faults arising during normal operation while introducing the least possible penalties is a main concern. Different forms of redundancy have been employed to ensure error-free behavior, while error detection mechanisms can be employed where some detection latency is tolerated. However, the high complexity and the low observability of microprocessors internal resources make the identification of adequate on-line error detection strategies a very challenging task, which can be tackled at circuit or system level. Concerning system-level strategies, a common limitation is in the mechanism used to monitor program execution and then detect errors as soon as possible, so as to reduce their impact on the application. In this work, an on-line error detection approach based on the reuse of available debugging infrastructures is proposed. The approach can be applied to different system architectures profiting from the debug trace port available in most of current microprocessors to observe possible misbehaviors. Two microprocessors have been used to study the applicability of the solution. LEON3 and ARM7TDMI. Results show that the presented fault detection technique enhances observability and thus error detection abilities in microprocessor-based systems without requiring modifications on the core architecture

    Survey of Soft Error Mitigation Techniques Applied to LEON3 Soft Processors on SRAM-Based FPGAs

    Get PDF
    Soft-core processors implemented in SRAM-based FPGAs are an attractive option for applications to be employed in radiation environments due to their flexibility, relatively-low application development costs, and reconfigurability features enabling them to adapt to the evolving mission needs. Despite the advantages soft-core processors possess, they are seldom used in critical applications because they are more sensitive to radiation than their hard-core counterparts. For instance, both the logic and signal routing circuitry of a soft-core processor as well as its user memory are susceptible to radiation-induced faults. Therefore, soft-core processors must be appropriately hardened against ionizing-radiation to become a feasible design choice for harsh environments and thus to reap all their benefits. This survey henceforth discusses various techniques to protect the configuration and user memories of an LEON3 soft processor, which is one of the most widely used soft-core processors in radiation environments, as reported in the state-of-the-art literature, with the objective of facilitating the choice of right fault-mitigation solution for any given soft-core processor

    A Hybrid Fault-Tolerant LEON3 Soft Core Processor Implemented in Low-End SRAM FPGA

    Get PDF
    In this work we implemented a hybrid fault-tolerant LEON3 soft-core processor in a low-end FPGA (Artix-7) and evaluated its error detection capabilities through neutron irradiation and fault injection in an incremental manner. The error mitigation approach combines the use of SEC/DED codes for memories, a hardware monitor to detect control-flow errors, software-based techniques to detect data errors and configuration memory scrubbing with repair to avoid error accumulation. The proposed solution can significantly improve fault tolerance and can be fully embedded in a low-end FPGA, with reduced overhead and low intrusiveness

    Toward Fault-Tolerant Applications on Reconfigurable Systems-on-Chip

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Error Detection and Diagnosis for System-on-Chip in Space Applications

    Get PDF
    Tesis por compendio de publicacionesLos componentes electrónicos comerciales, comúnmente llamados componentes Commercial-Off-The-Shelf (COTS) están presentes en multitud de dispositivos habituales en nuestro día a día. Particularmente, el uso de microprocesadores y sistemas en chip (SoC) altamente integrados ha favorecido la aparición de dispositivos electrónicos cada vez más inteligentes que sostienen el estilo de vida y el avance de la sociedad moderna. Su uso se ha generalizado incluso en aquellos sistemas que se consideran críticos para la seguridad, como vehículos, aviones, armamento, dispositivos médicos, implantes o centrales eléctricas. En cualquiera de ellos, un fallo podría tener graves consecuencias humanas o económicas. Sin embargo, todos los sistemas electrónicos conviven constantemente con factores internos y externos que pueden provocar fallos en su funcionamiento. La capacidad de un sistema para funcionar correctamente en presencia de fallos se denomina tolerancia a fallos, y es un requisito en el diseño y operación de sistemas críticos. Los vehículos espaciales como satélites o naves espaciales también hacen uso de microprocesadores para operar de forma autónoma o semi autónoma durante su vida útil, con la dificultad añadida de que no pueden ser reparados en órbita, por lo que se consideran sistemas críticos. Además, las duras condiciones existentes en el espacio, y en particular los efectos de la radiación, suponen un gran desafío para el correcto funcionamiento de los dispositivos electrónicos. Concretamente, los fallos transitorios provocados por radiación (conocidos como soft errors) tienen el potencial de ser una de las mayores amenazas para la fiabilidad de un sistema en el espacio. Las misiones espaciales de gran envergadura, típicamente financiadas públicamente como en el caso de la NASA o la Agencia Espacial Europea (ESA), han tenido históricamente como requisito evitar el riesgo a toda costa por encima de cualquier restricción de coste o plazo. Por ello, la selección de componentes resistentes a la radiación (rad-hard) específicamente diseñados para su uso en el espacio ha sido la metodología imperante en el paradigma que hoy podemos denominar industria espacial tradicional, u Old Space. Sin embargo, los componentes rad-hard tienen habitualmente un coste mucho más alto y unas prestaciones mucho menores que otros componentes COTS equivalentes. De hecho, los componentes COTS ya han sido utilizados satisfactoriamente en misiones de la NASA o la ESA cuando las prestaciones requeridas por la misión no podían ser cubiertas por ningún componente rad-hard existente. En los últimos años, el acceso al espacio se está facilitando debido en gran parte a la entrada de empresas privadas en la industria espacial. Estas empresas no siempre buscan evitar el riesgo a toda costa, sino que deben perseguir una rentabilidad económica, por lo que hacen un balance entre riesgo, coste y plazo mediante gestión del riesgo en un paradigma denominado Nuevo Espacio o New Space. Estas empresas a menudo están interesadas en entregar servicios basados en el espacio con las máximas prestaciones y el mayor beneficio posibles, para lo cual los componentes rad-hard son menos atractivos debido a su mayor coste y menores prestaciones que los componentes COTS existentes. Sin embargo, los componentes COTS no han sido específicamente diseñados para su uso en el espacio y típicamente no incluyen técnicas específicas para evitar que los efectos de la radiación afecten su funcionamiento. Los componentes COTS se comercializan tal cual son, y habitualmente no es posible modificarlos para mejorar su resistencia a la radiación. Además, los elevados niveles de integración de los sistemas en chip (SoC) complejos de altas prestaciones dificultan su observación y la aplicación de técnicas de tolerancia a fallos. Este problema es especialmente relevante en el caso de los microprocesadores. Por tanto, existe un gran interés en el desarrollo de técnicas que permitan conocer y mejorar el comportamiento de los microprocesadores COTS bajo radiación sin modificar su arquitectura y sin interferir en su funcionamiento para facilitar su uso en el espacio y con ello maximizar las prestaciones de las misiones espaciales presentes y futuras. En esta Tesis se han desarrollado técnicas novedosas para detectar, diagnosticar y mitigar los errores producidos por radiación en microprocesadores y sistemas en chip (SoC) comerciales, utilizando la interfaz de traza como punto de observación. La interfaz de traza es un recurso habitual en los microprocesadores modernos, principalmente enfocado a soportar las tareas de desarrollo y depuración del software durante la fase de diseño. Sin embargo, una vez el desarrollo ha concluido, la interfaz de traza típicamente no se utiliza durante la fase operativa del sistema, por lo que puede ser reutilizada sin coste. La interfaz de traza constituye un punto de conexión viable para observar el comportamiento de un microprocesador de forma no intrusiva y sin interferir en su funcionamiento. Como resultado de esta Tesis se ha desarrollado un módulo IP capaz de recabar y decodificar la información de traza de un microprocesador COTS moderno de altas prestaciones. El IP es altamente configurable y personalizable para adaptarse a diferentes aplicaciones y tipos de procesadores. Ha sido diseñado y validado utilizando el dispositivo Zynq-7000 de Xilinx como plataforma de desarrollo, que constituye un dispositivo COTS de interés en la industria espacial. Este dispositivo incluye un procesador ARM Cortex-A9 de doble núcleo, que es representativo del conjunto de microprocesadores hard-core modernos de altas prestaciones. El IP resultante es compatible con la tecnología ARM CoreSight, que proporciona acceso a información de traza en los microprocesadores ARM. El IP incorpora técnicas para detectar errores en el flujo de ejecución y en los datos de la aplicación ejecutada utilizando la información de traza, en tiempo real y con muy baja latencia. El IP se ha validado en campañas de inyección de fallos y también en radiación con protones y neutrones en instalaciones especializadas. También se ha combinado con otras técnicas de tolerancia a fallos para construir técnicas híbridas de mitigación de errores. Los resultados experimentales obtenidos demuestran su alta capacidad de detección y potencialidad en el diagnóstico de errores producidos por radiación. El resultado de esta Tesis, desarrollada en el marco de un Doctorado Industrial entre la Universidad Carlos III de Madrid (UC3M) y la empresa Arquimea, se ha transferido satisfactoriamente al entorno empresarial en forma de un proyecto financiado por la Agencia Espacial Europea para continuar su desarrollo y posterior explotación.Commercial electronic components, also known as Commercial-Off-The-Shelf (COTS), are present in a wide variety of devices commonly used in our daily life. Particularly, the use of microprocessors and highly integrated System-on-Chip (SoC) devices has fostered the advent of increasingly intelligent electronic devices which sustain the lifestyles and the progress of modern society. Microprocessors are present even in safety-critical systems, such as vehicles, planes, weapons, medical devices, implants, or power plants. In any of these cases, a fault could involve severe human or economic consequences. However, every electronic system deals continuously with internal and external factors that could provoke faults in its operation. The capacity of a system to operate correctly in presence of faults is known as fault-tolerance, and it becomes a requirement in the design and operation of critical systems. Space vehicles such as satellites or spacecraft also incorporate microprocessors to operate autonomously or semi-autonomously during their service life, with the additional difficulty that they cannot be repaired once in-orbit, so they are considered critical systems. In addition, the harsh conditions in space, and specifically radiation effects, involve a big challenge for the correct operation of electronic devices. In particular, radiation-induced soft errors have the potential to become one of the major risks for the reliability of systems in space. Large space missions, typically publicly funded as in the case of NASA or European Space Agency (ESA), have followed historically the requirement to avoid the risk at any expense, regardless of any cost or schedule restriction. Because of that, the selection of radiation-resistant components (known as rad-hard) specifically designed to be used in space has been the dominant methodology in the paradigm of traditional space industry, also known as “Old Space”. However, rad-hard components have commonly a much higher associated cost and much lower performance that other equivalent COTS devices. In fact, COTS components have already been used successfully by NASA and ESA in missions that requested such high performance that could not be satisfied by any available rad-hard component. In the recent years, the access to space is being facilitated in part due to the irruption of private companies in the space industry. Such companies do not always seek to avoid the risk at any cost, but they must pursue profitability, so they perform a trade-off between risk, cost, and schedule through risk management in a paradigm known as “New Space”. Private companies are often interested in deliver space-based services with the maximum performance and maximum benefit as possible. With such objective, rad-hard components are less attractive than COTS due to their higher cost and lower performance. However, COTS components have not been specifically designed to be used in space and typically they do not include specific techniques to avoid or mitigate the radiation effects in their operation. COTS components are commercialized “as is”, so it is not possible to modify them to improve their susceptibility to radiation effects. Moreover, the high levels of integration of complex, high-performance SoC devices hinder their observability and the application of fault-tolerance techniques. This problem is especially relevant in the case of microprocessors. Thus, there is a growing interest in the development of techniques allowing to understand and improve the behavior of COTS microprocessors under radiation without modifying their architecture and without interfering with their operation. Such techniques may facilitate the use of COTS components in space and maximize the performance of present and future space missions. In this Thesis, novel techniques have been developed to detect, diagnose, and mitigate radiation-induced errors in COTS microprocessors and SoCs using the trace interface as an observation point. The trace interface is a resource commonly found in modern microprocessors, mainly intended to support software development and debugging activities during the design phase. However, it is commonly left unused during the operational phase of the system, so it can be reused with no cost. The trace interface constitutes a feasible connection point to observe microprocessor behavior in a non-intrusive manner and without disturbing processor operation. As a result of this Thesis, an IP module has been developed capable to gather and decode the trace information of a modern, high-end, COTS microprocessor. The IP is highly configurable and customizable to support different applications and processor types. The IP has been designed and validated using the Xilinx Zynq-7000 device as a development platform, which is an interesting COTS device for the space industry. This device features a dual-core ARM Cortex-A9 processor, which is a good representative of modern, high-end, hard-core microprocessors. The resulting IP is compatible with the ARM CoreSight technology, which enables access to trace information in ARM microprocessors. The IP is able to detect errors in the execution flow of the microprocessor and in the application data using trace information, in real time and with very low latency. The IP has been validated in fault injection campaigns and also under proton and neutron irradiation campaigns in specialized facilities. It has also been combined with other fault-tolerance techniques to build hybrid error mitigation approaches. Experimental results demonstrate its high detection capabilities and high potential for the diagnosis of radiation-induced errors. The result of this Thesis, developed in the framework of an Industrial Ph.D. between the University Carlos III of Madrid (UC3M) and the company Arquimea, has been successfully transferred to the company business as a project sponsored by European Space Agency to continue its development and subsequent commercialization.Programa de Doctorado en Ingeniería Eléctrica, Electrónica y Automática por la Universidad Carlos III de MadridPresidenta: María Luisa López Vallejo.- Secretario: Enrique San Millán Heredia.- Vocal: Luigi Di Lill

    Low-cost error detection through high-level synthesis

    Get PDF
    System-on-chip design is becoming increasingly complex as technology scaling enables more and more functionality on a chip. This scaling and complexity has resulted in a variety of reliability and validation challenges including logic bugs, hot spots, wear-out, and soft errors. To make matters worse, as we reach the limits of Dennard scaling, efforts to improve system performance and energy efficiency have resulted in the integration of a wide variety of complex hardware accelerators in SoCs. Thus the challenge is to design complex, custom hardware that is efficient, but also correct and reliable. High-level synthesis shows promise to address the problem of complex hardware design by providing a bridge from the high-productivity software domain to the hardware design process. Much research has been done on high-level synthesis efficiency optimizations. This thesis shows that high-level synthesis also has the power to address validation and reliability challenges through two solutions. One solution for circuit reliability is modulo-3 shadow datapaths: performing lightweight shadow computations in modulo-3 space for each main computation. We leverage the binding and scheduling flexibility of high-level synthesis to detect control errors through diverse binding and minimize area cost through intelligent checkpoint scheduling and modulo-3 reducer sharing. We introduce logic and dataflow optimizations to further reduce cost. We evaluated our technique with 12 high-level synthesis benchmarks from the arithmetic-oriented PolyBench benchmark suite using FPGA emulated netlist-level error injection. We observe coverages of 99.1% for stuck-at faults, 99.5% for soft errors, and 99.6% for timing errors with a 25.7% area cost and negligible performance impact. Leveraging a mean error detection latency of 12.75 cycles (4150x faster than end result check) for soft errors, we also explore a rollback recovery method with an additional area cost of 28.0%, observing a 175x increase in reliability against soft errors. Another solution for rapid post-silicon validation of accelerator designs is Hybrid Quick Error Detection (H-QED): inserting signature generation logic in a hardware design to create a heavily compressed signature stream that captures the internal behavior of the design at a fine temporal and spatial granularity for comparison with a reference set of signatures generated by high-level simulation to detect bugs. Using H-QED, we demonstrate an improvement in error detection latency (time elapsed from when a bug is activated to when it manifests as an observable failure) of two orders of magnitude and a threefold improvement in bug coverage compared to traditional post-silicon validation techniques. H-QED also uncovered previously unknown bugs in the CHStone benchmark suite, which is widely used by the HLS community. H-QED incurs less than 10% area overhead for the accelerator it validates with negligible performance impact, and we also introduce techniques to minimize any possible intrusiveness introduced by H-QED
    corecore