1,560 research outputs found
DeSyRe: on-Demand System Reliability
The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints
Timing speculation and adaptive reliable overclocking techniques for aggressive computer systems
Computers have changed our lives beyond our own imagination in the past several decades. The continued and progressive advancements in VLSI technology and numerous micro-architectural innovations have played a key role in the design of spectacular low-cost high performance computing systems that have become omnipresent in today\u27s technology driven world. Performance and dependability have become key concerns as these ubiquitous computing machines continue to drive our everyday life. Every application has unique demands, as they run in diverse operating environments. Dependable, aggressive and adaptive systems improve efficiency in terms of speed, reliability and energy consumption.
Traditional computing systems run at a fixed clock frequency, which is determined by taking into account the worst-case timing paths, operating conditions, and process variations. Timing speculation based reliable overclocking advocates going beyond worst-case limits to achieve best performance while not avoiding, but detecting and correcting a modest number of timing errors. The success of this design methodology relies on the fact that timing critical paths are rarely exercised in a design, and typical execution happens much faster than the timing requirements dictated by worst-case design methodology. Better-than-worst-case design methodology is advocated by several recent research pursuits, which exploit dependability techniques to enhance computer system performance.
In this dissertation, we address different aspects of timing speculation based adaptive reliable overclocking schemes, and evaluate their role in the design of low-cost, high performance, energy efficient and dependable systems. We visualize various control knobs in the design that can be favorably controlled to ensure different design targets.
As part of this research, we extend the SPRIT3E, or Superscalar PeRformance Improvement Through Tolerating Timing Errors, framework, and characterize the extent of application dependent performance acceleration achievable in superscalar processors by scrutinizing the various parameters that impact the operation beyond worst-case limits. We study the limitations imposed by short-path constraints on our technique, and present ways to exploit them to maximize performance gains. We analyze the sensitivity of our technique\u27s adaptiveness by exploring the necessary hardware requirements for dynamic overclocking schemes. Experimental analysis based on SPEC2000 benchmarks running on a SimpleScalar Alpha processor simulator, augmented with error rate data obtained from hardware simulations of a superscalar processor, are presented.
Even though reliable overclocking guarantees functional correctness, it leads to higher power consumption. As a consequence, reliable overclocking without considering on-chip temperatures will bring down the lifetime reliability of the chip. In this thesis, we analyze how reliable overclocking impacts the on-chip temperature of a microprocessor and evaluate the effects of overheating, due to such reliable dynamic frequency tuning mechanisms, on the lifetime reliability of these systems. We then evaluate the effect of performing thermal throttling, a technique that clamps the on-chip temperature below a predefined value, on system performance and reliability. Our study shows that a reliably overclocked system with dynamic thermal management achieves 25% performance improvement, while lasting for 14 years when being operated within 353K.
Over the past five decades, technology scaling, as predicted by Moore\u27s law, has been the bedrock of semiconductor technology evolution. The continued downscaling of CMOS technology to deep sub-micron gate lengths has been the primary reason for its dominance in today\u27s omnipresent silicon microchips. Even as the transition to the next technology node is indispensable, the initial cost and time associated in doing so presents a non-level playing field for the competitors in the semiconductor business. As part of this thesis, we evaluate the capability of speculative reliable overclocking mechanisms to maximize performance at a given technology level. We evaluate its competitiveness when compared to technology scaling, in terms of performance, power consumption, energy and energy delay product. We present a comprehensive comparison for integer and floating point SPEC2000 benchmarks running on a simulated Alpha processor at three different technology nodes in normal and enhanced modes. Our results suggest that adopting reliable overclocking strategies will help skip a technology node altogether, or be competitive in the market, while porting to the next technology node.
Reliability has become a serious concern as systems embrace nanometer technologies. In this dissertation, we propose a novel fault tolerant aggressive system that combines soft error protection and timing error tolerance. We replicate both the pipeline registers and the pipeline stage combinational logic. The replicated logic receives its inputs from the primary pipeline registers while writing its output to the replicated pipeline registers. The organization of redundancy in the proposed Conjoined Pipeline system supports overclocking, provides concurrent error detection and recovery capability for soft errors, intermittent faults and timing errors, and flags permanent silicon defects. The fast recovery process requires no checkpointing and takes three cycles. Back annotated post-layout gate-level timing simulations, using 45nm technology, of a conjoined two-stage arithmetic pipeline and a conjoined five-stage DLX pipeline processor, with forwarding logic, show that our approach, even under a severe fault injection campaign, achieves near 100% fault coverage and an average performance improvement of about 20%, when dynamically overclocked
Design of a fault tolerant airborne digital computer. Volume 1: Architecture
This volume is concerned with the architecture of a fault tolerant digital computer for an advanced commercial aircraft. All of the computations of the aircraft, including those presently carried out by analogue techniques, are to be carried out in this digital computer. Among the important qualities of the computer are the following: (1) The capacity is to be matched to the aircraft environment. (2) The reliability is to be selectively matched to the criticality and deadline requirements of each of the computations. (3) The system is to be readily expandable. contractible, and (4) The design is to appropriate to post 1975 technology. Three candidate architectures are discussed and assessed in terms of the above qualities. Of the three candidates, a newly conceived architecture, Software Implemented Fault Tolerance (SIFT), provides the best match to the above qualities. In addition SIFT is particularly simple and believable. The other candidates, Bus Checker System (BUCS), also newly conceived in this project, and the Hopkins multiprocessor are potentially more efficient than SIFT in the use of redundancy, but otherwise are not as attractive
Self-Test Mechanisms for Automotive Multi-Processor System-on-Chips
L'abstract è presente nell'allegato / the abstract is in the attachmen
Error Detection and Diagnosis for System-on-Chip in Space Applications
Tesis por compendio de publicacionesLos componentes electrónicos comerciales, comúnmente llamados componentes
Commercial-Off-The-Shelf (COTS) están presentes en multitud de dispositivos habituales
en nuestro día a día. Particularmente, el uso de microprocesadores y sistemas en chip (SoC)
altamente integrados ha favorecido la aparición de dispositivos electrónicos cada vez más
inteligentes que sostienen el estilo de vida y el avance de la sociedad moderna. Su uso se
ha generalizado incluso en aquellos sistemas que se consideran críticos para la seguridad,
como vehículos, aviones, armamento, dispositivos médicos, implantes o centrales eléctricas.
En cualquiera de ellos, un fallo podría tener graves consecuencias humanas o económicas.
Sin embargo, todos los sistemas electrónicos conviven constantemente con factores internos
y externos que pueden provocar fallos en su funcionamiento. La capacidad de un sistema
para funcionar correctamente en presencia de fallos se denomina tolerancia a fallos, y es
un requisito en el diseño y operación de sistemas críticos.
Los vehículos espaciales como satélites o naves espaciales también hacen uso de
microprocesadores para operar de forma autónoma o semi autónoma durante su vida útil,
con la dificultad añadida de que no pueden ser reparados en órbita, por lo que se consideran
sistemas críticos. Además, las duras condiciones existentes en el espacio, y en particular
los efectos de la radiación, suponen un gran desafío para el correcto funcionamiento de los
dispositivos electrónicos. Concretamente, los fallos transitorios provocados por radiación
(conocidos como soft errors) tienen el potencial de ser una de las mayores amenazas para
la fiabilidad de un sistema en el espacio.
Las misiones espaciales de gran envergadura, típicamente financiadas públicamente
como en el caso de la NASA o la Agencia Espacial Europea (ESA), han tenido
históricamente como requisito evitar el riesgo a toda costa por encima de cualquier
restricción de coste o plazo. Por ello, la selección de componentes resistentes a la radiación
(rad-hard) específicamente diseñados para su uso en el espacio ha sido la metodología
imperante en el paradigma que hoy podemos denominar industria espacial tradicional, u
Old Space. Sin embargo, los componentes rad-hard tienen habitualmente un coste mucho
más alto y unas prestaciones mucho menores que otros componentes COTS equivalentes.
De hecho, los componentes COTS ya han sido utilizados satisfactoriamente en misiones
de la NASA o la ESA cuando las prestaciones requeridas por la misión no podían ser
cubiertas por ningún componente rad-hard existente.
En los últimos años, el acceso al espacio se está facilitando debido en gran parte a la
entrada de empresas privadas en la industria espacial. Estas empresas no siempre buscan
evitar el riesgo a toda costa, sino que deben perseguir una rentabilidad económica, por
lo que hacen un balance entre riesgo, coste y plazo mediante gestión del riesgo en un
paradigma denominado Nuevo Espacio o New Space. Estas empresas a menudo están
interesadas en entregar servicios basados en el espacio con las máximas prestaciones y el mayor beneficio posibles, para lo cual los componentes rad-hard son menos atractivos
debido a su mayor coste y menores prestaciones que los componentes COTS existentes.
Sin embargo, los componentes COTS no han sido específicamente diseñados para su uso
en el espacio y típicamente no incluyen técnicas específicas para evitar que los efectos de
la radiación afecten su funcionamiento. Los componentes COTS se comercializan tal cual
son, y habitualmente no es posible modificarlos para mejorar su resistencia a la radiación.
Además, los elevados niveles de integración de los sistemas en chip (SoC) complejos
de altas prestaciones dificultan su observación y la aplicación de técnicas de tolerancia
a fallos. Este problema es especialmente relevante en el caso de los microprocesadores.
Por tanto, existe un gran interés en el desarrollo de técnicas que permitan conocer y
mejorar el comportamiento de los microprocesadores COTS bajo radiación sin modificar
su arquitectura y sin interferir en su funcionamiento para facilitar su uso en el espacio y
con ello maximizar las prestaciones de las misiones espaciales presentes y futuras.
En esta Tesis se han desarrollado técnicas novedosas para detectar, diagnosticar y
mitigar los errores producidos por radiación en microprocesadores y sistemas en chip
(SoC) comerciales, utilizando la interfaz de traza como punto de observación. La interfaz de
traza es un recurso habitual en los microprocesadores modernos, principalmente enfocado
a soportar las tareas de desarrollo y depuración del software durante la fase de diseño. Sin
embargo, una vez el desarrollo ha concluido, la interfaz de traza típicamente no se utiliza
durante la fase operativa del sistema, por lo que puede ser reutilizada sin coste. La interfaz
de traza constituye un punto de conexión viable para observar el comportamiento de un
microprocesador de forma no intrusiva y sin interferir en su funcionamiento.
Como resultado de esta Tesis se ha desarrollado un módulo IP capaz de recabar
y decodificar la información de traza de un microprocesador COTS moderno de altas
prestaciones. El IP es altamente configurable y personalizable para adaptarse a diferentes
aplicaciones y tipos de procesadores. Ha sido diseñado y validado utilizando el dispositivo
Zynq-7000 de Xilinx como plataforma de desarrollo, que constituye un dispositivo COTS
de interés en la industria espacial. Este dispositivo incluye un procesador ARM Cortex-A9
de doble núcleo, que es representativo del conjunto de microprocesadores hard-core
modernos de altas prestaciones. El IP resultante es compatible con la tecnología ARM
CoreSight, que proporciona acceso a información de traza en los microprocesadores ARM.
El IP incorpora técnicas para detectar errores en el flujo de ejecución y en los datos de la
aplicación ejecutada utilizando la información de traza, en tiempo real y con muy baja
latencia. El IP se ha validado en campañas de inyección de fallos y también en radiación con
protones y neutrones en instalaciones especializadas. También se ha combinado con otras
técnicas de tolerancia a fallos para construir técnicas híbridas de mitigación de errores.
Los resultados experimentales obtenidos demuestran su alta capacidad de detección y
potencialidad en el diagnóstico de errores producidos por radiación.
El resultado de esta Tesis, desarrollada en el marco de un Doctorado Industrial entre
la Universidad Carlos III de Madrid (UC3M) y la empresa Arquimea, se ha transferido satisfactoriamente al entorno empresarial en forma de un proyecto financiado por la
Agencia Espacial Europea para continuar su desarrollo y posterior explotación.Commercial electronic components, also known as Commercial-Off-The-Shelf (COTS),
are present in a wide variety of devices commonly used in our daily life. Particularly, the
use of microprocessors and highly integrated System-on-Chip (SoC) devices has fostered
the advent of increasingly intelligent electronic devices which sustain the lifestyles and the
progress of modern society. Microprocessors are present even in safety-critical systems,
such as vehicles, planes, weapons, medical devices, implants, or power plants. In any of
these cases, a fault could involve severe human or economic consequences. However, every
electronic system deals continuously with internal and external factors that could provoke
faults in its operation. The capacity of a system to operate correctly in presence of faults
is known as fault-tolerance, and it becomes a requirement in the design and operation of
critical systems.
Space vehicles such as satellites or spacecraft also incorporate microprocessors to
operate autonomously or semi-autonomously during their service life, with the additional
difficulty that they cannot be repaired once in-orbit, so they are considered critical systems.
In addition, the harsh conditions in space, and specifically radiation effects, involve a big
challenge for the correct operation of electronic devices. In particular, radiation-induced
soft errors have the potential to become one of the major risks for the reliability of systems
in space.
Large space missions, typically publicly funded as in the case of NASA or European
Space Agency (ESA), have followed historically the requirement to avoid the risk at any
expense, regardless of any cost or schedule restriction. Because of that, the selection of
radiation-resistant components (known as rad-hard) specifically designed to be used in
space has been the dominant methodology in the paradigm of traditional space industry,
also known as “Old Space”. However, rad-hard components have commonly a much higher
associated cost and much lower performance that other equivalent COTS devices. In fact,
COTS components have already been used successfully by NASA and ESA in missions
that requested such high performance that could not be satisfied by any available rad-hard
component.
In the recent years, the access to space is being facilitated in part due to the irruption
of private companies in the space industry. Such companies do not always seek to avoid
the risk at any cost, but they must pursue profitability, so they perform a trade-off between
risk, cost, and schedule through risk management in a paradigm known as “New Space”.
Private companies are often interested in deliver space-based services with the maximum
performance and maximum benefit as possible. With such objective, rad-hard components
are less attractive than COTS due to their higher cost and lower performance.
However, COTS components have not been specifically designed to be used in space
and typically they do not include specific techniques to avoid or mitigate the radiation effects in their operation. COTS components are commercialized “as is”, so it is not
possible to modify them to improve their susceptibility to radiation effects. Moreover,
the high levels of integration of complex, high-performance SoC devices hinder their
observability and the application of fault-tolerance techniques. This problem is especially
relevant in the case of microprocessors. Thus, there is a growing interest in the development
of techniques allowing to understand and improve the behavior of COTS microprocessors
under radiation without modifying their architecture and without interfering with their
operation. Such techniques may facilitate the use of COTS components in space and
maximize the performance of present and future space missions.
In this Thesis, novel techniques have been developed to detect, diagnose, and
mitigate radiation-induced errors in COTS microprocessors and SoCs using the trace
interface as an observation point. The trace interface is a resource commonly found
in modern microprocessors, mainly intended to support software development and
debugging activities during the design phase. However, it is commonly left unused
during the operational phase of the system, so it can be reused with no cost. The trace
interface constitutes a feasible connection point to observe microprocessor behavior in a
non-intrusive manner and without disturbing processor operation.
As a result of this Thesis, an IP module has been developed capable to gather and
decode the trace information of a modern, high-end, COTS microprocessor. The IP is highly
configurable and customizable to support different applications and processor types. The
IP has been designed and validated using the Xilinx Zynq-7000 device as a development
platform, which is an interesting COTS device for the space industry. This device features a
dual-core ARM Cortex-A9 processor, which is a good representative of modern, high-end,
hard-core microprocessors. The resulting IP is compatible with the ARM CoreSight
technology, which enables access to trace information in ARM microprocessors. The IP is
able to detect errors in the execution flow of the microprocessor and in the application data
using trace information, in real time and with very low latency. The IP has been validated
in fault injection campaigns and also under proton and neutron irradiation campaigns in
specialized facilities. It has also been combined with other fault-tolerance techniques
to build hybrid error mitigation approaches. Experimental results demonstrate its high
detection capabilities and high potential for the diagnosis of radiation-induced errors.
The result of this Thesis, developed in the framework of an Industrial Ph.D. between the
University Carlos III of Madrid (UC3M) and the company Arquimea, has been successfully
transferred to the company business as a project sponsored by European Space Agency to
continue its development and subsequent commercialization.Programa de Doctorado en Ingeniería Eléctrica, Electrónica y Automática por la Universidad Carlos III de MadridPresidenta: María Luisa López Vallejo.- Secretario: Enrique San Millán Heredia.- Vocal: Luigi Di Lill
Reliability-aware and energy-efficient system level design for networks-on-chip
2015 Spring.Includes bibliographical references.With CMOS technology aggressively scaling into the ultra-deep sub-micron (UDSM) regime and application complexity growing rapidly in recent years, processors today are being driven to integrate multiple cores on a chip. Such chip multiprocessor (CMP) architectures offer unprecedented levels of computing performance for highly parallel emerging applications in the era of digital convergence. However, a major challenge facing the designers of these emerging multicore architectures is the increased likelihood of failure due to the rise in transient, permanent, and intermittent faults caused by a variety of factors that are becoming more and more prevalent with technology scaling. On-chip interconnect architectures are particularly susceptible to faults that can corrupt transmitted data or prevent it from reaching its destination. Reliability concerns in UDSM nodes have in part contributed to the shift from traditional bus-based communication fabrics to network-on-chip (NoC) architectures that provide better scalability, performance, and utilization than buses. In this thesis, to overcome potential faults in NoCs, my research began by exploring fault-tolerant routing algorithms. Under the constraint of deadlock freedom, we make use of the inherent redundancy in NoCs due to multiple paths between packet sources and sinks and propose different fault-tolerant routing schemes to achieve much better fault tolerance capabilities than possible with traditional routing schemes. The proposed schemes also use replication opportunistically to optimize the balance between energy overhead and arrival rate. As 3D integrated circuit (3D-IC) technology with wafer-to-wafer bonding has been recently proposed as a promising candidate for future CMPs, we also propose a fault-tolerant routing scheme for 3D NoCs which outperforms the existing popular routing schemes in terms of energy consumption, performance and reliability. To quantify reliability and provide different levels of intelligent protection, for the first time, we propose the network vulnerability factor (NVF) metric to characterize the vulnerability of NoC components to faults. NVF determines the probabilities that faults in NoC components manifest as errors in the final program output of the CMP system. With NVF aware partial protection for NoC components, almost 50% energy cost can be saved compared to the traditional approach of comprehensively protecting all NoC components. Lastly, we focus on the problem of fault-tolerant NoC design, that involves many NP-hard sub-problems such as core mapping, fault-tolerant routing, and fault-tolerant router configuration. We propose a novel design-time (RESYN) and a hybrid design and runtime (HEFT) synthesis framework to trade-off energy consumption and reliability in the NoC fabric at the system level for CMPs. Together, our research in fault-tolerant NoC routing, reliability modeling, and reliability aware NoC synthesis substantially enhances NoC reliability and energy-efficiency beyond what is possible with traditional approaches and state-of-the-art strategies from prior work
Fault Tolerant Electronic System Design
Due to technology scaling, which means reduced transistor size, higher density, lower voltage and more aggressive clock frequency, VLSI devices may become more sensitive against soft errors. Especially for those devices used in safety- and mission-critical applications, dependability and reliability are becoming increasingly important constraints during the development of system on/around them. Other phenomena (e.g., aging and wear-out effects) also have negative impacts on reliability of modern circuits. Recent researches show that even at sea level, radiation particles can still induce soft errors in electronic systems.
On one hand, processor-based system are commonly used in a wide variety of applications, including safety-critical and high availability missions, e.g., in the automotive, biomedical and aerospace domains. In these fields, an error may produce catastrophic consequences. Thus, dependability is a primary target that must be achieved taking into account tight constraints in terms of cost, performance, power and time to market. With standards and regulations (e.g., ISO-26262, DO-254, IEC-61508) clearly specify the targets to be achieved and the methods to prove their achievement, techniques working at system level are particularly attracting.
On the other hand, Field Programmable Gate Array (FPGA) devices are becoming more and more attractive, also in safety- and mission-critical applications due to the high performance, low power consumption and the flexibility for reconfiguration they provide. Two types of FPGAs are commonly used, based on their configuration memory cell technology, i.e., SRAM-based and Flash-based FPGA. For SRAM-based FPGAs, the SRAM cells of the configuration memory highly susceptible to radiation induced effects which can leads to system failure; and for Flash-based FPGAs, even though their non-volatile configuration memory cells are almost immune to Single Event Upsets induced by energetic particles, the floating gate switches and the logic cells in the configuration tiles can still suffer from Single Event Effects when hit by an highly charged particle. So analysis and mitigation techniques for Single Event Effects on FPGAs are becoming increasingly important in the design flow especially when reliability is one of the main requirements
Design and evaluation of buffered triple modular redundancy in interleaved-multi-threading processors
Fault management in digital chips is a crucial aspect of functional safety. Significant work has been done on gate and microarchitecture level triple modular redundancy, and on functional redundancy in multi-core and simultaneous-multi-threading processors, whereas little has been done to quantify the fault tolerance potential of interleaved-multi-threading. In this study, we apply the temporal-spatial triple modular redundancy concept to interleaved-multi-threading processors through a design solution that we call Buffered triple modular redundancy, using the soft-core Klessydra-T03 as the basis for our experiments. We then illustrate the quantitative findings of a large fault-injection simulation campaign on the fault-tolerant core and discuss the vulnerability comparison with previous representative fault-tolerant designs. The results show that the obtained resilience is comparable to a full triple modular redundancy at the cost of execution cycle count overhead instead of hardware overhead, yet with higher achievable clock frequency
- …