1,698 research outputs found
Fault Secure Encoder and Decoder for NanoMemory Applications
Memory cells have been protected from soft errors for more than a decade; due to the increase in soft error rate in logic circuits, the encoder and decoder circuitry around the memory blocks have become susceptible to soft errors as well and must also be protected. We introduce a new approach to design fault-secure encoder and decoder circuitry for memory designs. The key novel contribution of this paper is identifying and defining a new class of error-correcting codes whose redundancy makes the design of fault-secure detectors (FSD) particularly simple. We further quantify the importance of protecting encoder and decoder circuitry against transient errors, illustrating a scenario where the system failure rate (FIT) is dominated by the failure rate of the encoder and decoder. We prove that Euclidean geometry low-density parity-check (EG-LDPC) codes have the fault-secure detector capability. Using some of the smaller EG-LDPC codes, we can tolerate bit or nanowire defect rates of 10% and fault rates of 10^(-18) upsets/device/cycle, achieving a FIT rate at or below one for the entire memory system and a memory density of 10^(11) bit/cm^2 with nanowire pitch of 10 nm for memory blocks of 10 Mb or larger. Larger EG-LDPC codes can achieve even higher reliability and lower area overhead
Novel fault tolerant Multi-Bit Upset (MBU) Error-Detection and Correction (EDAC) architecture
Desde el punto de vista de seguridad, la certificación aeronáutica de
aplicaciones críticas de vuelo requiere diferentes técnicas que son usadas
para prevenir fallos en los equipos electrónicos. Los fallos de tipo hardware
debido a la radiación solar que existe a las alturas standard de vuelo, como
SEU (Single Event Upset) y MCU (Multiple Bit Upset), provocan un cambio
de estado de los bits que soportan la información almacenada en memoria.
Estos fallos se producen, por ejemplo, en la memoria de configuración de
una FPGA, que es donde se definen todas las funcionalidades. Las técnicas
de protección requieren normalmente de redundancias que incrementan el
coste, número de componentes, tamaño de la memoria y peso.
En la fase de desarrollo de aplicaciones críticas de vuelo, generalmente
se utilizan una serie de estándares o recomendaciones de diseño como
ABD100, RTCA DO-160, IEC62395, etc, y diferentes técnicas de protección
para evitar fallos del tipo SEU o MCU. Estas técnicas están basadas en
procesos tecnológicos específicos como memorias robustas, codificaciones
para detección y corrección de errores (EDAC), redundancias software,
redundancia modular triple (TMR) o soluciones a nivel sistema.
Esta tesis está enfocada a minimizar e incluso suprimir los efectos de los
SEUs y MCUs que particularmente ocurren en la electrónica de avión como
consecuencia de la exposición a radiación de partículas no cargadas (como
son los neutrones) que se encuentra potenciada a las típicas alturas de
vuelo. La criticidad en vuelo que tienen determinados sistemas obligan a que
dichos sistemas sean tolerantes a fallos, es decir, que garanticen un
correcto funcionamiento aún cuando se produzca un fallo en ellos. Es por
ello que soluciones como las presentadas en esta tesis tienen interés en el
sector industrial.
La Tesis incluye una descripción inicial de la física de la radiación
incidente sobre aeronaves, y el análisis de sus efectos en los componentes
electrónicos aeronaúticos basados en semiconductor, que desembocan en
la generación de SEUs y MCUs. Este análisis permite dimensionar
adecuadamente y optimizar los procedimientos de corrección que se
propongan posteriormente.
La Tesis propone un sistema de corrección de fallos SEUs y MCUs que
permita cumplir la condición de Sistema Tolerante a Fallos, a la vez que
minimiza los niveles de redundancia y de complejidad de los códigos de
corrección. El nivel de redundancia es minimizado con la introducción del
concepto propuesto HSB (Hardwired Seed Bits), en la que se reduce la
información esencial a unos pocos bits semilla, neutros frente a radiación.
Los códigos de corrección requeridos se reducen a la corrección de un único
error, gracias al uso del concepto de Distancia Virtual entre Bits, a partir del
cual será posible corregir múltiples errores simultáneos (MCUs) a partir de
códigos simples de corrección.
Un ejemplo de aplicación de la Tesis es la implementación de una
Protección Tolerante a Fallos sobre la memoria SRAM de una FPGA. Esto
significa que queda protegida no sólo la información contenida en la
memoria sino que también queda auto-protegida la función de protección
misma almacenada en la propia SRAM. De esta forma, el sistema es capaz
de auto-regenerarse ante un SEU o incluso un MCU, independientemente
de la zona de la SRAM sobre la que impacte la radiación. Adicionalmente,
esto se consigue con códigos simples tales como corrección por bit de
paridad y Hamming, minimizando la dedicación de recursos de computación
hacia tareas de supervisión del sistema.For airborne safety critical applications certification, different techniques
are implemented to prevent failures in electronic equipments. The HW
failures at flying heights of aircrafts related to solar radiation such as SEU
(Single-Event-Upset) and MCU (Multiple Bit Upset), causes bits alterations
that corrupt the information at memories. These HW failures cause errors, for
example, in the Configuration-Code of an FPGA that defines the
functionalities. The protection techniques require classically redundant
functionalities that increases the cost, components, memory space and
weight.
During the development phase for airborne safety critical applications,
different aerospace standards are generally recommended as ABD100,
RTCA-DO160, IEC62395, etc, and different techniques are classically used
to avoid failures such as SEU or MCU. These techniques are based on
specific technology processes, Hardened memories, error detection and
correction codes (EDAC), SW redundancy, Triple Modular Redundancy
(TMR) or System level solutions.
This Thesis is focussed to minimize, and even to remove, the effects of
SEUs and MCUs, that particularly occurs in the airborne electronics as a
consequence of its exposition to solar radiation of non-charged particles (for
example the neutrons). These non-charged particles are even powered at
flying altitudes due to aircraft volume. The safety categorization of different
equipments/functionalities requires a design based on fault-tolerant approach
that means, the system will continue its normal operation even if a failure
occurs. The solution proposed in this Thesis is relevant for the industrial
sector because of its Fault-tolerant capability.
Thesis includes an initial description for the physics of the solar radiation
that affects into aircrafts, and also the analyses of their effects into the
airborne electronics based on semiconductor components that create the
SEUs and MCUs. This detailed analysis allows the correct sizing and also
the optimization of the procedures used to correct the errors.
This Thesis proposes a system that corrects the SEUs and MCUs
allowing the fulfilment of the Fault-Tolerant requirement, reducing the
redundancy resources and also the complexity of the correction codes. The
redundancy resources are minimized thanks to the introduction of the
concept of HSB (Hardwired Seed Bits), in which the essential information is
reduced to a few seed bits, neutral to radiation. The correction codes
required are reduced to the correction of one error thanks to the use of the
concept of interleaving distance between adjacent bits, this allows the
simultaneous multiple error correction with simple single error correcting
codes.
An example of the application of this Thesis is the implementation of the
Fault-tolerant architecture of an SRAM-based FPGA. That means that the
information saved in the memory is protected but also the correction
functionality is auto protected as well, also saved into SRAM memory. In this
way, the system is able to self-regenerate the information lost in case of
SEUs or MCUs. This is independent of the SRAM area affected by the
radiation. Furthermore, this performance is achieved by means simple error
correcting codes, as parity bits or Hamming, that minimize the use of
computational resources to this supervision tasks for system.Programa Oficial de Doctorado en Ingeniería Eléctrica, Electrónica y AutomáticaPresidente: Luis Alfonso Entrena Arrontes.- Secretario: Pedro Reviriego Vasallo.- Vocal: Mª Luisa López Vallej
Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery
The trend of downsizing transistors and operating voltage scaling has made the processor chip more sensitive against radiation phenomena making soft errors an important challenge. New reliability techniques for handling soft errors in the logic and memories that allow meeting the desired failures-in-time (FIT) target are key to keep harnessing the benefits of Moore's law. The failure to scale the soft error rate caused by particle strikes, may soon limit the total number of cores that one may have running at the same time. This paper proposes a light-weight and scalable architecture to eliminate silent data corruption errors (SDC) and detected unrecoverable errors (DUE) of a core. The architecture uses acoustic wave detectors for error detection. We propose to recover by confining the errors in the cache hierarchy, allowing us to deal with the relatively long detection latencies. Our results show that the proposed mechanism protects the whole core (logic, latches and memory arrays) incurring performance overhead as low as 0.60%. © 2014 IEEE.Peer ReviewedPostprint (author's final draft
Using Fine Grain Approaches for highly reliable Design of FPGA-based Systems in Space
Nowadays using SRAM based FPGAs in space missions is increasingly considered due to their flexibility and reprogrammability. A challenge is the devices sensitivity to radiation effects that increased with modern architectures due to smaller CMOS structures. This work proposes fault tolerance methodologies, that are based on a fine grain view to modern reconfigurable architectures. The focus is on SEU mitigation challenges in SRAM based FPGAs which can result in crucial situations
MCU Tolerance in SRAMs through Low Redundancy Triple Adjacent Error Correction
(c) 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.[EN] Static random access memories (SRAMs) are key in electronic systems. They are used not only as standalone devices, but also embedded in application specific integrated circuits. One key challenge for memories is their susceptibility to radiation-induced soft errors that change the value of memory cells. Error correction codes (ECCs) are commonly used to ensure correct data despite soft errors effects in semiconductor memories. Single error correction/double error detection (SEC-DED) codes have been traditionally the preferred choice for data protection in SRAMs. During the last decade, the percentage of errors that affect more than one memory cell has increased substantially, mainly due to multiple cell upsets (MCUs) caused by radiation. The bits affected by these errors are physically close. To mitigate their effects, ECCs that correct single errors and double adjacent errors have been proposed. These codes, known as single error correction/double adjacent error correction (SEC-DAEC), require the same number of parity bits as traditional SEC-DED codes and a moderate increase in the decoder complexity. However, MCUs are not limited to double adjacent errors, because they affect more bits as technology scales. In this brief, new codes that can correct triple adjacent errors and 3-bit burst errors are presented. They have been implemented using a 45-nm library and compared with previous proposals, showing that our codes have better error protection with a moderate overhead and low redundancy.This work was supported in part by the Universitat Politecnica de Valencia, Valencia, Spain, through the DesTT Research Project under Grant SP20120806; in part by the Spanish Ministry of Science and Education under Project AYA-2009-13300-C03; in part by the Arenes Research Project under Grant TIN2012-38308-C02-01; and in part by the Research Project entitled Manufacturable and Dependable Multicore Architectures at Nanoscale within the framework of COST ICT Action under Grant 1103.Saiz-Adalid, L.; Reviriego, P.; Gil, P.; Pontarelli, S.; Maestro, JA. (2015). MCU Tolerance in SRAMs through Low Redundancy Triple Adjacent Error Correction. IEEE Transactions on Very Large Scale Integration (VLSI) Systems. 23(10):2332-2336. https://doi.org/10.1109/TVLSI.2014.2357476S23322336231
New Design Techniques for Dynamic Reconfigurable Architectures
L'abstract è presente nell'allegato / the abstract is in the attachmen
Radiation Mitigation and Power Optimization Design Tools for Reconfigurable Hardware in Orbit
The Reconfigurable Hardware in Orbit (RHinO)project is focused on creating a set of design tools that facilitate and automate design techniques for reconfigurable computing in space, using SRAM-based field-programmable-gate-array (FPGA) technology. In the second year of the project, design tools that leverage an established FPGA design environment have been created to visualize and analyze an FPGA circuit for radiation weaknesses and power inefficiencies. For radiation, a single event Upset (SEU) emulator, persistence analysis tool, and a half-latch removal tool for Xilinx/Virtex-II devices have been created. Research is underway on a persistence mitigation tool and multiple bit upsets (MBU) studies. For power, synthesis level dynamic power visualization and analysis tools have been completed. Power optimization tools are under development and preliminary test results are positive
DeSyRe: on-Demand System Reliability
The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints
- …