861 research outputs found
Novel fault tolerant Multi-Bit Upset (MBU) Error-Detection and Correction (EDAC) architecture
Desde el punto de vista de seguridad, la certificación aeronáutica de
aplicaciones críticas de vuelo requiere diferentes técnicas que son usadas
para prevenir fallos en los equipos electrónicos. Los fallos de tipo hardware
debido a la radiación solar que existe a las alturas standard de vuelo, como
SEU (Single Event Upset) y MCU (Multiple Bit Upset), provocan un cambio
de estado de los bits que soportan la información almacenada en memoria.
Estos fallos se producen, por ejemplo, en la memoria de configuración de
una FPGA, que es donde se definen todas las funcionalidades. Las técnicas
de protección requieren normalmente de redundancias que incrementan el
coste, número de componentes, tamaño de la memoria y peso.
En la fase de desarrollo de aplicaciones críticas de vuelo, generalmente
se utilizan una serie de estándares o recomendaciones de diseño como
ABD100, RTCA DO-160, IEC62395, etc, y diferentes técnicas de protección
para evitar fallos del tipo SEU o MCU. Estas técnicas están basadas en
procesos tecnológicos específicos como memorias robustas, codificaciones
para detección y corrección de errores (EDAC), redundancias software,
redundancia modular triple (TMR) o soluciones a nivel sistema.
Esta tesis está enfocada a minimizar e incluso suprimir los efectos de los
SEUs y MCUs que particularmente ocurren en la electrónica de avión como
consecuencia de la exposición a radiación de partículas no cargadas (como
son los neutrones) que se encuentra potenciada a las típicas alturas de
vuelo. La criticidad en vuelo que tienen determinados sistemas obligan a que
dichos sistemas sean tolerantes a fallos, es decir, que garanticen un
correcto funcionamiento aún cuando se produzca un fallo en ellos. Es por
ello que soluciones como las presentadas en esta tesis tienen interés en el
sector industrial.
La Tesis incluye una descripción inicial de la física de la radiación
incidente sobre aeronaves, y el análisis de sus efectos en los componentes
electrónicos aeronaúticos basados en semiconductor, que desembocan en
la generación de SEUs y MCUs. Este análisis permite dimensionar
adecuadamente y optimizar los procedimientos de corrección que se
propongan posteriormente.
La Tesis propone un sistema de corrección de fallos SEUs y MCUs que
permita cumplir la condición de Sistema Tolerante a Fallos, a la vez que
minimiza los niveles de redundancia y de complejidad de los códigos de
corrección. El nivel de redundancia es minimizado con la introducción del
concepto propuesto HSB (Hardwired Seed Bits), en la que se reduce la
información esencial a unos pocos bits semilla, neutros frente a radiación.
Los códigos de corrección requeridos se reducen a la corrección de un único
error, gracias al uso del concepto de Distancia Virtual entre Bits, a partir del
cual será posible corregir múltiples errores simultáneos (MCUs) a partir de
códigos simples de corrección.
Un ejemplo de aplicación de la Tesis es la implementación de una
Protección Tolerante a Fallos sobre la memoria SRAM de una FPGA. Esto
significa que queda protegida no sólo la información contenida en la
memoria sino que también queda auto-protegida la función de protección
misma almacenada en la propia SRAM. De esta forma, el sistema es capaz
de auto-regenerarse ante un SEU o incluso un MCU, independientemente
de la zona de la SRAM sobre la que impacte la radiación. Adicionalmente,
esto se consigue con códigos simples tales como corrección por bit de
paridad y Hamming, minimizando la dedicación de recursos de computación
hacia tareas de supervisión del sistema.For airborne safety critical applications certification, different techniques
are implemented to prevent failures in electronic equipments. The HW
failures at flying heights of aircrafts related to solar radiation such as SEU
(Single-Event-Upset) and MCU (Multiple Bit Upset), causes bits alterations
that corrupt the information at memories. These HW failures cause errors, for
example, in the Configuration-Code of an FPGA that defines the
functionalities. The protection techniques require classically redundant
functionalities that increases the cost, components, memory space and
weight.
During the development phase for airborne safety critical applications,
different aerospace standards are generally recommended as ABD100,
RTCA-DO160, IEC62395, etc, and different techniques are classically used
to avoid failures such as SEU or MCU. These techniques are based on
specific technology processes, Hardened memories, error detection and
correction codes (EDAC), SW redundancy, Triple Modular Redundancy
(TMR) or System level solutions.
This Thesis is focussed to minimize, and even to remove, the effects of
SEUs and MCUs, that particularly occurs in the airborne electronics as a
consequence of its exposition to solar radiation of non-charged particles (for
example the neutrons). These non-charged particles are even powered at
flying altitudes due to aircraft volume. The safety categorization of different
equipments/functionalities requires a design based on fault-tolerant approach
that means, the system will continue its normal operation even if a failure
occurs. The solution proposed in this Thesis is relevant for the industrial
sector because of its Fault-tolerant capability.
Thesis includes an initial description for the physics of the solar radiation
that affects into aircrafts, and also the analyses of their effects into the
airborne electronics based on semiconductor components that create the
SEUs and MCUs. This detailed analysis allows the correct sizing and also
the optimization of the procedures used to correct the errors.
This Thesis proposes a system that corrects the SEUs and MCUs
allowing the fulfilment of the Fault-Tolerant requirement, reducing the
redundancy resources and also the complexity of the correction codes. The
redundancy resources are minimized thanks to the introduction of the
concept of HSB (Hardwired Seed Bits), in which the essential information is
reduced to a few seed bits, neutral to radiation. The correction codes
required are reduced to the correction of one error thanks to the use of the
concept of interleaving distance between adjacent bits, this allows the
simultaneous multiple error correction with simple single error correcting
codes.
An example of the application of this Thesis is the implementation of the
Fault-tolerant architecture of an SRAM-based FPGA. That means that the
information saved in the memory is protected but also the correction
functionality is auto protected as well, also saved into SRAM memory. In this
way, the system is able to self-regenerate the information lost in case of
SEUs or MCUs. This is independent of the SRAM area affected by the
radiation. Furthermore, this performance is achieved by means simple error
correcting codes, as parity bits or Hamming, that minimize the use of
computational resources to this supervision tasks for system.Programa Oficial de Doctorado en Ingeniería Eléctrica, Electrónica y AutomáticaPresidente: Luis Alfonso Entrena Arrontes.- Secretario: Pedro Reviriego Vasallo.- Vocal: Mª Luisa López Vallej
Hardware Mechanisms for Efficient Memory System Security
The security of a computer system hinges on the trustworthiness of the operating system and the hardware, as applications rely on them to protect code and data. As a result, multiple protections for safeguarding the hardware and OS from attacks are being continuously proposed and deployed. These defenses, however, are far from ideal as they only provide partial protection, require complex hardware and software stacks, or incur high overheads. This dissertation presents hardware mechanisms for efficiently providing strong protections against an array of attacks on the memory hardware and the operating system’s code and data.
In the first part of this dissertation, we analyze and optimize protections targeted at defending memory hardware from physical attacks. We begin by showing that, contrary to popular belief, current DDR3 and DDR4 memory systems that employ memory scrambling are still susceptible to cold boot attacks (where the DRAM is frozen to give it sufficient retention time and is then re-read by an attacker after reboot to extract sensitive data). We then describe how memory scramblers in modern memory controllers can be transparently replaced by strong stream ciphers without impacting performance.
We also demonstrate how the large storage overheads associated with authenticated memory encryption schemes (which enable tamper-proof storage in off-chip memories) can be reduced by leveraging compact integer encodings and error-correcting code (ECC) DRAMs – without forgoing the error detection and correction capabilities of ECC DRAMs.
The second part of this dissertation presents Neverland: a low-overhead, hardware-assisted, memory protection scheme that safeguards the operating system from rootkits and kernel-mode malware. Once the system is done booting, Neverland’s hardware takes away the operating system’s ability to overwrite certain configuration registers, as well as portions of its own physical address space that contain kernel code and security-critical data. Furthermore, it prohibits the CPU from fetching privileged code from any memory region lying outside the physical addresses assigned to the OS kernel and drivers. This combination of protections makes it extremely hard for an attacker to tamper with the kernel or introduce new privileged code into the system – even in the presence of software vulnerabilities. Neverland enables operating systems to reduce their attack surface without having to rely on complex integrity monitoring software or hardware.
The hardware mechanisms we present in this dissertation provide building blocks for constructing a secure computing base while incurring lower overheads than existing protections.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147604/1/salessaf_1.pd
Survey of Soft Error Mitigation Techniques Applied to LEON3 Soft Processors on SRAM-Based FPGAs
Soft-core processors implemented in SRAM-based FPGAs are an attractive option for applications to be employed in radiation environments due to their flexibility, relatively-low application development costs, and reconfigurability features enabling them to adapt to the evolving mission needs. Despite the advantages soft-core processors possess, they are seldom used in critical applications because they are more sensitive to radiation than their hard-core counterparts. For instance, both the logic and signal routing circuitry of a soft-core processor as well as its user memory are susceptible to radiation-induced faults. Therefore, soft-core processors must be appropriately hardened against ionizing-radiation to become a feasible design choice for harsh environments and thus to reap all their benefits. This survey henceforth discusses various techniques to protect the configuration and user memories of an LEON3 soft processor, which is one of the most widely used soft-core processors in radiation environments, as reported in the state-of-the-art literature, with the objective of facilitating the choice of right fault-mitigation solution for any given soft-core processor
Fault-tolerant computer study
A set of building block circuits is described which can be used with commercially available microprocessors and memories to implement fault tolerant distributed computer systems. Each building block circuit is intended for VLSI implementation as a single chip. Several building blocks and associated processor and memory chips form a self checking computer module with self contained input output and interfaces to redundant communications buses. Fault tolerance is achieved by connecting self checking computer modules into a redundant network in which backup buses and computer modules are provided to circumvent failures. The requirements and design methodology which led to the definition of the building block circuits are discussed
Enabling Recovery of Secure Non-Volatile Memories
Emerging non-volatile memories (NVMs), such as phase change memory (PCM), spin-transfer torque RAM (STT-RAM) and resistive RAM (ReRAM), have dual memory-storage characteristics and, therefore, are strong candidates to replace or augment current DRAM and secondary storage devices. The newly released Intel 3D XPoint persistent memory and Optane SSD series have shown promising features. However, when these new devices are exposed to events such as power loss, many issues arise when data recovery is expected. In this dissertation, I devised multiple schemes to enable secure data recovery for emerging NVM technologies when memory encryption is used. With the data-remanence feature of NVMs, physical attacks become easier; hence, emerging NVMs are typically paired with encryption. In particular, counter-mode encryption is commonly used due to its performance and security advantages over other schemes (e.g., electronic codebook encryption). However, enabling data recovery in power failure events requires the recovery of security metadata associated with data blocks. Naively writing security metadata updates along with data for each operation can further exacerbate the write endurance problem of NVMs as they have limited write endurance and very slow write operations. Therefore, it is necessary to enable the recovery of data and security metadata (encryption counters) but without incurring a significant number of writes. The first work of this dissertation presents an explanation of Osiris, a novel mechanism that repurposes error correcting code (ECC) co-located with data to enable recovery of encryption counters by additionally serving as a sanity-check for encryption counters used. Thus, by using a stop-loss mechanism with a limited number of trials, ECC can be used to identify which encryption counter that was used most recently to encrypt the data and, hence, allow correct decryption and recovery. The first work of this dissertation explores how different stop-loss parameters along with optimizations of Osiris can potentially reduce the number of writes. Overall, Osiris enables the recovery of encryption counters while achieving better performance and fewer writes than a conventional write-back caching scheme of encryption counters, which lacks the ability to recover encryption counters. Later, in the second work, Osiris implementation is expanded to work with different counter-mode memory encryption schemes, where we use an epoch-based approach to periodically persist updated counters. Later, when a crash occurs, we can recover counters through test-and-verification to identify the correct counter within the size of an epoch for counter recovery. Our proposed scheme, Osiris-Global, incurs minimal performance overheads and write overheads in enabling the recovery of encryption counters. In summary, the findings of the present PhD work enable the recovery of secure NVM systems and, hence, allows persistent applications to leverage the persistency features of NVMs. Meanwhile, it also minimizes the number of writes required in meeting this crash consistency requirement of secure NVM systems
Affordable techniques for dependable microprocessor design
As high computing power is available at an affordable cost, we rely on microprocessor-based systems for much greater variety of applications. This dependence indicates that a processor failure could have more diverse impacts on our daily lives. Therefore, dependability is becoming an increasingly important quality measure of microprocessors.;Temporary hardware malfunctions caused by unstable environmental conditions can lead the processor to an incorrect state. This is referred to as a transient error or soft error. Studies have shown that soft errors are the major source of system failures. This dissertation characterizes the soft error behavior on microprocessors and presents new microarchitectural approaches that can realize high dependability with low overhead.;Our fault injection studies using RISC processors have demonstrated that different functional blocks of the processor have distinct susceptibilities to soft errors. The error susceptibility information must be reflected in devising fault tolerance schemes for cost-sensitive applications. Considering the common use of on-chip caches in modern processors, we investigated area-efficient protection schemes for memory arrays. The idea of caching redundant information was exploited to optimize resource utilization for increased dependability. We also developed a mechanism to verify the integrity of data transfer from lower level memories to the primary caches. The results of this study show that by exploiting bus idle cycles and the information redundancy, an almost complete check for the initial memory data transfer is possible without incurring a performance penalty.;For protecting the processor\u27s control logic, which usually remains unprotected, we propose a low-cost reliability enhancement strategy. We classified control logic signals into static and dynamic control depending on their changeability, and applied various techniques including commit-time checking, signature caching, component-level duplication, and control flow monitoring. Our schemes can achieve more than 99% coverage with a very small hardware addition. Finally, a virtual duplex architecture for superscalar processors is presented. In this system-level approach, the processor pipeline is backed up by a partially replicated pipeline. The replication-based checker minimizes the design and verification overheads. For a large-scale superscalar processor, the proposed architecture can bring 61.4% reduction in die area while sustaining the maximum performance
Error Characterization and Correction Techniques for Reliable STT-RAM Designs
The concerns on the continuous scaling of mainstream memory technologies have motivated tremendous investment to emerging memories. Being a promising candidate, spin-transfer torque random access memory (STT-RAM) offers nanosecond access time comparable to SRAM, high integration density close to DRAM, non-volatility as Flash memory, and good scalability. It is well positioned as the replacement of SRAM and DRAM for on-chip cache and main memory applications. However, reliability issue continues being one of the major challenges in STT-RAM memory designs due to the process variations and unique thermal fluctuations, i.e., the stochastic resistance switching property of magnetic devices.
In this dissertation, I decoupled the reliability issues as following three-folds: First, the characterization of STT-RAM operation errors often require expensive Monte-Carlo runs with hybrid magnetic-CMOS simulation steps, making it impracticable for architects and system designs; Second, the state of the art does not have sufficiently understanding on the unique reliability issue of STT-RAM, and conventional error correction codes (ECCs) cannot efficiently handle such errors; Third, while the information density of STT-RAM can be boosted by multi-level cell (MLC) design, the more prominent reliability concerns and the complicated access mechanism greatly limit its applications in memory subsystems.
Thus, I present a novel through solution set to both characterize and tackle the above reliability challenges in STT-RAM designs. In the first part of the dissertation, I introduce a new characterization method that can accurately and efficiently capture the multi-variable design metrics of STT-RAM cells; Second, a novel ECC scheme, namely, content-dependent ECC (CD-ECC), is developed to combat the characterized asymmetric errors of STT-RAM at 0->1 and 1->0 bit flipping's; Third, I present a circuit-architecture design, namely state-restricted multi-level cell (SR-MLC) STT-RAM design, which simultaneously achieves high information density, good storage reliability and fast write speed, making MLC STT-RAM accessible for system designers under current technology node. Finally, I conclude that efficient robust (or ECC) designs for STT-RAM require a deep holistic understanding on three different levels-device, circuit and architecture. Innovative ECC schemes and their architectural applications, still deserve serious research and investigation in the near future
Hardware / Software Architectural and Technological Exploration for Energy-Efficient and Reliable Biomedical Devices
Nowadays, the ubiquity of smart appliances in our everyday lives is increasingly strengthening the links between humans and machines. Beyond making our lives easier and more convenient, smart devices are now playing an important role in personalized healthcare delivery. This technological breakthrough is particularly relevant in a world where population aging and unhealthy habits have made non-communicable diseases the first leading cause of death worldwide according to international public health organizations. In this context, smart health monitoring systems termed Wireless Body Sensor Nodes (WBSNs), represent a paradigm shift in the healthcare landscape by greatly lowering the cost of long-term monitoring of chronic diseases, as well as improving patients' lifestyles. WBSNs are able to autonomously acquire biological signals and embed on-node Digital Signal Processing (DSP) capabilities to deliver clinically-accurate health diagnoses in real-time, even outside of a hospital environment. Energy efficiency and reliability are fundamental requirements for WBSNs, since they must operate for extended periods of time, while relying on compact batteries. These constraints, in turn, impose carefully designed hardware and software architectures for hosting the execution of complex biomedical applications. In this thesis, I develop and explore novel solutions at the architectural and technological level of the integrated circuit design domain, to enhance the energy efficiency and reliability of current WBSNs. Firstly, following a top-down approach driven by the characteristics of biomedical algorithms, I perform an architectural exploration of a heterogeneous and reconfigurable computing platform devoted to bio-signal analysis. By interfacing a shared Coarse-Grained Reconfigurable Array (CGRA) accelerator, this domain-specific platform can achieve higher performance and energy savings, beyond the capabilities offered by a baseline multi-processor system. More precisely, I propose three CGRA architectures, each contributing differently to the maximization of the application parallelization. The proposed Single, Multi and Interleaved-Datapath CGRA designs allow the developed platform to achieve substantial energy savings of up to 37%, when executing complex biomedical applications, with respect to a multi-core-only platform. Secondly, I investigate how the modeling of technology reliability issues in logic and memory components can be exploited to adequately adjust the frequency and supply voltage of a circuit, with the aim of optimizing its computing performance and energy efficiency. To this end, I propose a novel framework for workload-dependent Bias Temperature Instability (BTI) impact analysis on biomedical application results quality. Remarkably, the framework is able to determine the range of safe circuit operating frequencies without introducing worst-case guard bands. Experiments highlight the possibility to safely raise the frequency up to 101% above the maximum obtained with the classical static timing analysis. Finally, through the study of several well-known biomedical algorithms, I propose an approach allowing energy savings by dynamically and unequally protecting an under-powered data memory in a new way compared to regular error protection schemes. This solution relies on the Dynamic eRror compEnsation And Masking (DREAM) technique that reduces by approximately 21% the energy consumed by traditional error correction codes
Error control coding for semiconductor memories
All modern computers have memories built from VLSI RAM chips.
Individually, these devices are highly reliable and any single chip
may perform for decades before failing. However, when many of the
chips are combined in a single memory, the time that at least one
of them fails could decrease to mere few hours. The presence of
the failed chips causes errors when binary data are stored in and
read out from the memory. As a consequence the reliability of the
computer memories degrade. These errors are classified into hard
errors and soft errors. These can also be termed as permanent and
temporary errors respectively.
In some situations errors may show up as random errors, in
which both 1-to-O errors and 0-to-l errors occur randomly in a
memory word. In other situations the most likely errors are
unidirectional errors in which 1-to-O errors or 0-to-l errors may
occur but not both of them in one particular memory word.
To achieve a high speed and highly reliable computer, we need
large capacity memory. Unfortunately, with high density of
semiconductor cells in memory, the error rate increases
dramatically. Especially, the VLSI RAMs suffer from soft errors
caused by alpha-particle radiation. Thus the reliability of
computer could become unacceptable without error reducing schemes.
In practice several schemes to reduce the effects of the memory
errors were commonly used. But most of them are valid only for hard errors. As an efficient and economical method, error control
coding can be used to overcome both hard and soft errors.
Therefore it is becoming a widely used scheme in computer industry
today.
In this thesis, we discuss error control coding for
semiconductor memories. The thesis consists of six chapters.
Chapter one is an introduction to error detecting and correcting
coding for computer memories. Firstly, semiconductor memories and
their problems are discussed. Then some schemes for error reduction
in computer memories are given and the advantages of using error
control coding over other schemes are presented.
In chapter two, after a brief review of memory organizations,
memory cells and their physical constructions and principle of
storing data are described. Then we analyze mechanisms of various
errors occurring in semiconductor memories so that, for different
errors different coding schemes could be selected.
Chapter three is devoted to the fundamental coding theory. In
this chapter background on encoding and decoding algorithms are
presented.
In chapter four, random error control codes are discussed.
Among them error detecting codes, single* error correcting/double
error detecting codes and multiple error correcting codes are
analyzed. By using examples, the decoding implementations for
parity codes, Hamming codes, modified Hamming codes and majority
logic codes are demonstrated. Also in this chapter it was shown
that by combining error control coding and other schemes, the reliability of the memory can be improved by many orders.
For unidirectional errors, we introduced unordered codes in
chapter five. Two types of the unordered codes are discussed. They
are systematic and nonsystematic unordered codes. Both of them are
very powerful for unidirectional error detection. As an example of
optimal nonsystematic unordered code, an efficient balanced code
are analyzed. Then as an example of systematic unordered codes
Berger codes are analyzed. Considering the fact that in practice
random errors still may occur in unidirectional error memories,
some recently developed t-random error correcting/all
unidirectional error detecting codes are introduced. Illustrative
examples are also included to facilitate the explanation.
Chapter six is the conclusions of the thesis.
The whole thesis is oriented to the applications of error
control coding for semiconductor memories. Most of the codes
discussed in the thesis are widely used in practice. Through the
thesis we attempt to provide a review of coding in computer
memories and emphasize the advantage of coding. It is obvious that
with the requirement of higher speed and higher capacity
semiconductor memories, error control coding will play even more
important role in the future
- …