894 research outputs found
DeSyRe: on-Demand System Reliability
The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints
On the use of embedded debug features for permanent and transient fault resilience in microprocessors
Microprocessor-based systems are employed in an increasing number of applications where dependability is a major constraint. For this reason detecting faults arising during normal operation while introducing the least possible penalties is a main concern. Different forms of redundancy have been employed to ensure error-free behavior, while error detection mechanisms can be employed where some detection latency is tolerated. However, the high complexity and the low observability of microprocessors internal resources make the identification of adequate on-line error detection strategies a very challenging task, which can be tackled at circuit or system level. Concerning system-level strategies, a common limitation is in the mechanism used to monitor program execution and then detect errors as soon as possible, so as to reduce their impact on the application. In this work, an on-line error detection approach based on the reuse of available debugging infrastructures is proposed. The approach can be applied to different system architectures profiting from the debug trace port available in most of current microprocessors to observe possible misbehaviors. Two microprocessors have been used to study the applicability of the solution. LEON3 and ARM7TDMI. Results show that the presented fault detection technique enhances observability and thus error detection abilities in microprocessor-based systems without requiring modifications on the core architecture
Error Detection and Diagnosis for System-on-Chip in Space Applications
Tesis por compendio de publicacionesLos componentes electrónicos comerciales, comúnmente llamados componentes
Commercial-Off-The-Shelf (COTS) están presentes en multitud de dispositivos habituales
en nuestro día a día. Particularmente, el uso de microprocesadores y sistemas en chip (SoC)
altamente integrados ha favorecido la aparición de dispositivos electrónicos cada vez más
inteligentes que sostienen el estilo de vida y el avance de la sociedad moderna. Su uso se
ha generalizado incluso en aquellos sistemas que se consideran críticos para la seguridad,
como vehículos, aviones, armamento, dispositivos médicos, implantes o centrales eléctricas.
En cualquiera de ellos, un fallo podría tener graves consecuencias humanas o económicas.
Sin embargo, todos los sistemas electrónicos conviven constantemente con factores internos
y externos que pueden provocar fallos en su funcionamiento. La capacidad de un sistema
para funcionar correctamente en presencia de fallos se denomina tolerancia a fallos, y es
un requisito en el diseño y operación de sistemas críticos.
Los vehículos espaciales como satélites o naves espaciales también hacen uso de
microprocesadores para operar de forma autónoma o semi autónoma durante su vida útil,
con la dificultad añadida de que no pueden ser reparados en órbita, por lo que se consideran
sistemas críticos. Además, las duras condiciones existentes en el espacio, y en particular
los efectos de la radiación, suponen un gran desafío para el correcto funcionamiento de los
dispositivos electrónicos. Concretamente, los fallos transitorios provocados por radiación
(conocidos como soft errors) tienen el potencial de ser una de las mayores amenazas para
la fiabilidad de un sistema en el espacio.
Las misiones espaciales de gran envergadura, típicamente financiadas públicamente
como en el caso de la NASA o la Agencia Espacial Europea (ESA), han tenido
históricamente como requisito evitar el riesgo a toda costa por encima de cualquier
restricción de coste o plazo. Por ello, la selección de componentes resistentes a la radiación
(rad-hard) específicamente diseñados para su uso en el espacio ha sido la metodología
imperante en el paradigma que hoy podemos denominar industria espacial tradicional, u
Old Space. Sin embargo, los componentes rad-hard tienen habitualmente un coste mucho
más alto y unas prestaciones mucho menores que otros componentes COTS equivalentes.
De hecho, los componentes COTS ya han sido utilizados satisfactoriamente en misiones
de la NASA o la ESA cuando las prestaciones requeridas por la misión no podían ser
cubiertas por ningún componente rad-hard existente.
En los últimos años, el acceso al espacio se está facilitando debido en gran parte a la
entrada de empresas privadas en la industria espacial. Estas empresas no siempre buscan
evitar el riesgo a toda costa, sino que deben perseguir una rentabilidad económica, por
lo que hacen un balance entre riesgo, coste y plazo mediante gestión del riesgo en un
paradigma denominado Nuevo Espacio o New Space. Estas empresas a menudo están
interesadas en entregar servicios basados en el espacio con las máximas prestaciones y el mayor beneficio posibles, para lo cual los componentes rad-hard son menos atractivos
debido a su mayor coste y menores prestaciones que los componentes COTS existentes.
Sin embargo, los componentes COTS no han sido específicamente diseñados para su uso
en el espacio y típicamente no incluyen técnicas específicas para evitar que los efectos de
la radiación afecten su funcionamiento. Los componentes COTS se comercializan tal cual
son, y habitualmente no es posible modificarlos para mejorar su resistencia a la radiación.
Además, los elevados niveles de integración de los sistemas en chip (SoC) complejos
de altas prestaciones dificultan su observación y la aplicación de técnicas de tolerancia
a fallos. Este problema es especialmente relevante en el caso de los microprocesadores.
Por tanto, existe un gran interés en el desarrollo de técnicas que permitan conocer y
mejorar el comportamiento de los microprocesadores COTS bajo radiación sin modificar
su arquitectura y sin interferir en su funcionamiento para facilitar su uso en el espacio y
con ello maximizar las prestaciones de las misiones espaciales presentes y futuras.
En esta Tesis se han desarrollado técnicas novedosas para detectar, diagnosticar y
mitigar los errores producidos por radiación en microprocesadores y sistemas en chip
(SoC) comerciales, utilizando la interfaz de traza como punto de observación. La interfaz de
traza es un recurso habitual en los microprocesadores modernos, principalmente enfocado
a soportar las tareas de desarrollo y depuración del software durante la fase de diseño. Sin
embargo, una vez el desarrollo ha concluido, la interfaz de traza típicamente no se utiliza
durante la fase operativa del sistema, por lo que puede ser reutilizada sin coste. La interfaz
de traza constituye un punto de conexión viable para observar el comportamiento de un
microprocesador de forma no intrusiva y sin interferir en su funcionamiento.
Como resultado de esta Tesis se ha desarrollado un módulo IP capaz de recabar
y decodificar la información de traza de un microprocesador COTS moderno de altas
prestaciones. El IP es altamente configurable y personalizable para adaptarse a diferentes
aplicaciones y tipos de procesadores. Ha sido diseñado y validado utilizando el dispositivo
Zynq-7000 de Xilinx como plataforma de desarrollo, que constituye un dispositivo COTS
de interés en la industria espacial. Este dispositivo incluye un procesador ARM Cortex-A9
de doble núcleo, que es representativo del conjunto de microprocesadores hard-core
modernos de altas prestaciones. El IP resultante es compatible con la tecnología ARM
CoreSight, que proporciona acceso a información de traza en los microprocesadores ARM.
El IP incorpora técnicas para detectar errores en el flujo de ejecución y en los datos de la
aplicación ejecutada utilizando la información de traza, en tiempo real y con muy baja
latencia. El IP se ha validado en campañas de inyección de fallos y también en radiación con
protones y neutrones en instalaciones especializadas. También se ha combinado con otras
técnicas de tolerancia a fallos para construir técnicas híbridas de mitigación de errores.
Los resultados experimentales obtenidos demuestran su alta capacidad de detección y
potencialidad en el diagnóstico de errores producidos por radiación.
El resultado de esta Tesis, desarrollada en el marco de un Doctorado Industrial entre
la Universidad Carlos III de Madrid (UC3M) y la empresa Arquimea, se ha transferido satisfactoriamente al entorno empresarial en forma de un proyecto financiado por la
Agencia Espacial Europea para continuar su desarrollo y posterior explotación.Commercial electronic components, also known as Commercial-Off-The-Shelf (COTS),
are present in a wide variety of devices commonly used in our daily life. Particularly, the
use of microprocessors and highly integrated System-on-Chip (SoC) devices has fostered
the advent of increasingly intelligent electronic devices which sustain the lifestyles and the
progress of modern society. Microprocessors are present even in safety-critical systems,
such as vehicles, planes, weapons, medical devices, implants, or power plants. In any of
these cases, a fault could involve severe human or economic consequences. However, every
electronic system deals continuously with internal and external factors that could provoke
faults in its operation. The capacity of a system to operate correctly in presence of faults
is known as fault-tolerance, and it becomes a requirement in the design and operation of
critical systems.
Space vehicles such as satellites or spacecraft also incorporate microprocessors to
operate autonomously or semi-autonomously during their service life, with the additional
difficulty that they cannot be repaired once in-orbit, so they are considered critical systems.
In addition, the harsh conditions in space, and specifically radiation effects, involve a big
challenge for the correct operation of electronic devices. In particular, radiation-induced
soft errors have the potential to become one of the major risks for the reliability of systems
in space.
Large space missions, typically publicly funded as in the case of NASA or European
Space Agency (ESA), have followed historically the requirement to avoid the risk at any
expense, regardless of any cost or schedule restriction. Because of that, the selection of
radiation-resistant components (known as rad-hard) specifically designed to be used in
space has been the dominant methodology in the paradigm of traditional space industry,
also known as “Old Space”. However, rad-hard components have commonly a much higher
associated cost and much lower performance that other equivalent COTS devices. In fact,
COTS components have already been used successfully by NASA and ESA in missions
that requested such high performance that could not be satisfied by any available rad-hard
component.
In the recent years, the access to space is being facilitated in part due to the irruption
of private companies in the space industry. Such companies do not always seek to avoid
the risk at any cost, but they must pursue profitability, so they perform a trade-off between
risk, cost, and schedule through risk management in a paradigm known as “New Space”.
Private companies are often interested in deliver space-based services with the maximum
performance and maximum benefit as possible. With such objective, rad-hard components
are less attractive than COTS due to their higher cost and lower performance.
However, COTS components have not been specifically designed to be used in space
and typically they do not include specific techniques to avoid or mitigate the radiation effects in their operation. COTS components are commercialized “as is”, so it is not
possible to modify them to improve their susceptibility to radiation effects. Moreover,
the high levels of integration of complex, high-performance SoC devices hinder their
observability and the application of fault-tolerance techniques. This problem is especially
relevant in the case of microprocessors. Thus, there is a growing interest in the development
of techniques allowing to understand and improve the behavior of COTS microprocessors
under radiation without modifying their architecture and without interfering with their
operation. Such techniques may facilitate the use of COTS components in space and
maximize the performance of present and future space missions.
In this Thesis, novel techniques have been developed to detect, diagnose, and
mitigate radiation-induced errors in COTS microprocessors and SoCs using the trace
interface as an observation point. The trace interface is a resource commonly found
in modern microprocessors, mainly intended to support software development and
debugging activities during the design phase. However, it is commonly left unused
during the operational phase of the system, so it can be reused with no cost. The trace
interface constitutes a feasible connection point to observe microprocessor behavior in a
non-intrusive manner and without disturbing processor operation.
As a result of this Thesis, an IP module has been developed capable to gather and
decode the trace information of a modern, high-end, COTS microprocessor. The IP is highly
configurable and customizable to support different applications and processor types. The
IP has been designed and validated using the Xilinx Zynq-7000 device as a development
platform, which is an interesting COTS device for the space industry. This device features a
dual-core ARM Cortex-A9 processor, which is a good representative of modern, high-end,
hard-core microprocessors. The resulting IP is compatible with the ARM CoreSight
technology, which enables access to trace information in ARM microprocessors. The IP is
able to detect errors in the execution flow of the microprocessor and in the application data
using trace information, in real time and with very low latency. The IP has been validated
in fault injection campaigns and also under proton and neutron irradiation campaigns in
specialized facilities. It has also been combined with other fault-tolerance techniques
to build hybrid error mitigation approaches. Experimental results demonstrate its high
detection capabilities and high potential for the diagnosis of radiation-induced errors.
The result of this Thesis, developed in the framework of an Industrial Ph.D. between the
University Carlos III of Madrid (UC3M) and the company Arquimea, has been successfully
transferred to the company business as a project sponsored by European Space Agency to
continue its development and subsequent commercialization.Programa de Doctorado en Ingeniería Eléctrica, Electrónica y Automática por la Universidad Carlos III de MadridPresidenta: María Luisa López Vallejo.- Secretario: Enrique San Millán Heredia.- Vocal: Luigi Di Lill
A Survey of Recent Developments in Testability, Safety and Security of RISC-V Processors
With the continued success of the open RISC-V architecture, practical deployment of RISC-V processors necessitates an in-depth consideration of their testability, safety and security aspects. This survey provides an overview of recent developments in this quickly-evolving field. We start with discussing the application of state-of-the-art functional and system-level test solutions to RISC-V processors. Then, we discuss the use of RISC-V processors for safety-related applications; to this end, we outline the essential techniques necessary to obtain safety both in the functional and in the timing domain and review recent processor designs with safety features. Finally, we survey the different aspects of security with respect to RISC-V implementations and discuss the relationship between cryptographic protocols and primitives on the one hand and the RISC-V processor architecture and hardware implementation on the other. We also comment on the role of a RISC-V processor for system security and its resilience against side-channel attacks
Toward Fault-Tolerant Applications on Reconfigurable Systems-on-Chip
L'abstract è presente nell'allegato / the abstract is in the attachmen
Real-Time Trace Decoding and Monitoring for Safety and Security in Embedded Systems
Integrated circuits and systems can be found almost everywhere in today’s world. As their use increases, they need to be made safer and more perfor mant to meet current demands in processing power. FPGA integrated SoCs can provide the ideal trade-off between performance, adaptability, and energy usage. One of today’s vital challenges lies in updating existing fault tolerance techniques for these new systems while utilizing all available processing capa bilities, such as multi-core and heterogeneous processing units. Control-flow monitoring is one of the primary mechanisms described for error detection at the software architectural level for the highest grade of hazard level clas sifications (e.g., ASIL D) described in industry safety standards ISO-26262. Control-flow errors are also known to compose the majority of detected errors for ICs and embedded systems in safety-critical and risk-susceptible environ ments [5]. Software-based monitoring methods remain the most popular [6–8]. However, recent studies show that the overheads they impose make actual reliability gains negligible [9, 10]. This work proposes and demonstrates a new control flow checking method implemented in FPGA for multi-core embedded systems called control-flow trace checker (CFTC). CFTC uses existing trace and debug subsystems of modern processors to rebuild their execution states. It can iden tify any errors in real-time by comparing executed states to a set of permitted state transitions determined statically. This novel implementation weighs hardware resource trade-offs to target mul tiple independent tasks in multi-core embedded applications, as well as single core systems. The proposed system is entirely implemented in hardware and isolated from all monitored software components, requiring 2.4% of the target FPGA platform resources to protect an execution unit in its entirety. There fore, it avoids undesired overheads and maintains deterministic error detection latencies, which guarantees reliability improvements without impairing the target software system. Finally, CFTC is evaluated under different software i Resumo fault-injection scenarios, achieving detection rates of 100% of all control-flow errors to wrong destinations and 98% of all injected faults to program binaries. All detection times are further analyzed and precisely described by a model based on the monitor’s resources and speed and the software application’s control-flow structure and binary characteristics.Circuitos integrados estão presentes em quase todos sistemas complexos do mundo moderno. Conforme sua frequência de uso aumenta, eles precisam se tornar mais seguros e performantes para conseguir atender as novas demandas em potência de processamento. Sistemas em Chip integrados com FPGAs conseguem prover o balanço perfeito entre desempenho, adaptabilidade, e uso de energia. Um dos maiores desafios agora é a necessidade de atualizar técnicas de tolerância à falhas para estes novos sistemas, aproveitando os novos avanços em capacidade de processamento. Monitoramento de fluxo de controle é um dos principais mecanismos para a detecção de erros em nível de software para sistemas classificados como de alto risco (e.g. ASIL D), descrito em padrões de segurança como o ISO-26262. Estes erros são conhecidos por compor a maioria dos erros detectados em sistemas integrados [5]. Embora métodos de monitoramento baseados em software continuem sendo os mais populares [6–8], estudos recentes mostram que seus custos adicionais, em termos de performance e área, diminuem consideravelmente seus ganhos reais em confiabilidade [9, 10]. Propomos aqui um novo método de monitora mento de fluxo de controle implementado em FPGA para sistemas embarcados multi-core. Este método usa subsistemas de trace e execução de código para reconstruir o estado atual do processador, identificando erros através de com parações entre diferentes estados de execução da CPU. Propomos uma implementação que considera trade-offs no uso de recuros de sistema para monitorar múltiplas tarefas independetes. Nossa abordagem suporta o monitoramento de sistemas simples e também de sistemas multi-core multitarefa. Por fim, nossa técnica é totalmente implementada em hardware, evitando o uso de unidades de processamento de software que possa adicionar custos indesejáveis à aplicação em perda de confiabilidade. Propomos, assim, um mecanismo de verificação de fluxo de controle, escalável e extensível, para proteção de sistemas embarcados críticos e multi-core
- …