580 research outputs found
PTM-based hybrid error-detection architecture for ARM microprocessors
This work presents a hybrid error detection architecture that uses ARM PTM trace interface to observe ARM microprocessor behaviour. The proposed approach is suitable for COTS microprocessors because it does not modify the microprocessor architecture and is able to detect errors thanks to the reuse of its trace subsystem. Validation has been performed by proton irradiation and fault injection campaigns on a Zynq AP SoC including a Cortex-A9 ARM microprocessor and an implementation of the proposed hardware monitor in programmable logic. Experimental results demonstrate that a high error detection rate can be achieved on a commercial microprocessor
System-on-Chip Design and Test with Embedded Debug Capabilities
In this project, I started with a System-on-Chip platform with embedded test structures. The baseline platform consisted of a Leon2 CPU, AMBA on-chip bus, and an Advanced Encryption Standard decryption module. The basic objective of this thesis was to use the embedded reconfigurable logic blocks for post-silicon debug and verification.
The System-on-Chip platform was designed at the register transistor level and implemented in a 180-nm IBM process. Test logic instrumentation was done with DAFCA (Design Automation for Flexible Chip Architecture) Inc. pre-silicon tools. The design was then synthesized using the Synopsys Design Compiler and placed and routed using Cadence SOC Encounter. Total transistor count is about 3 million, including 1400K transistors for the debug module serving as on chip logic analyzer. Core size of the design is about 4.8mm x 4.8mm and the system is working at 151MHz. Design verification was done with Cadence NCSim.
The controllability and observability of internal signals of the design is greatly increased with the help of pre-silicon tools which helps locate bugs and later fix them with the help of post-silicon tools. This helps prevent re-spins on several occasions thus saving millions of dollars. Post-silicon tools have been used to program assertions and triggers and inject numerous personalities into the reconfigurable fabric which has greatly increased the versatility of the circuit
Error Detection and Diagnosis for System-on-Chip in Space Applications
Tesis por compendio de publicacionesLos componentes electrónicos comerciales, comúnmente llamados componentes
Commercial-Off-The-Shelf (COTS) están presentes en multitud de dispositivos habituales
en nuestro día a día. Particularmente, el uso de microprocesadores y sistemas en chip (SoC)
altamente integrados ha favorecido la aparición de dispositivos electrónicos cada vez más
inteligentes que sostienen el estilo de vida y el avance de la sociedad moderna. Su uso se
ha generalizado incluso en aquellos sistemas que se consideran críticos para la seguridad,
como vehículos, aviones, armamento, dispositivos médicos, implantes o centrales eléctricas.
En cualquiera de ellos, un fallo podría tener graves consecuencias humanas o económicas.
Sin embargo, todos los sistemas electrónicos conviven constantemente con factores internos
y externos que pueden provocar fallos en su funcionamiento. La capacidad de un sistema
para funcionar correctamente en presencia de fallos se denomina tolerancia a fallos, y es
un requisito en el diseño y operación de sistemas críticos.
Los vehículos espaciales como satélites o naves espaciales también hacen uso de
microprocesadores para operar de forma autónoma o semi autónoma durante su vida útil,
con la dificultad añadida de que no pueden ser reparados en órbita, por lo que se consideran
sistemas críticos. Además, las duras condiciones existentes en el espacio, y en particular
los efectos de la radiación, suponen un gran desafío para el correcto funcionamiento de los
dispositivos electrónicos. Concretamente, los fallos transitorios provocados por radiación
(conocidos como soft errors) tienen el potencial de ser una de las mayores amenazas para
la fiabilidad de un sistema en el espacio.
Las misiones espaciales de gran envergadura, típicamente financiadas públicamente
como en el caso de la NASA o la Agencia Espacial Europea (ESA), han tenido
históricamente como requisito evitar el riesgo a toda costa por encima de cualquier
restricción de coste o plazo. Por ello, la selección de componentes resistentes a la radiación
(rad-hard) específicamente diseñados para su uso en el espacio ha sido la metodología
imperante en el paradigma que hoy podemos denominar industria espacial tradicional, u
Old Space. Sin embargo, los componentes rad-hard tienen habitualmente un coste mucho
más alto y unas prestaciones mucho menores que otros componentes COTS equivalentes.
De hecho, los componentes COTS ya han sido utilizados satisfactoriamente en misiones
de la NASA o la ESA cuando las prestaciones requeridas por la misión no podían ser
cubiertas por ningún componente rad-hard existente.
En los últimos años, el acceso al espacio se está facilitando debido en gran parte a la
entrada de empresas privadas en la industria espacial. Estas empresas no siempre buscan
evitar el riesgo a toda costa, sino que deben perseguir una rentabilidad económica, por
lo que hacen un balance entre riesgo, coste y plazo mediante gestión del riesgo en un
paradigma denominado Nuevo Espacio o New Space. Estas empresas a menudo están
interesadas en entregar servicios basados en el espacio con las máximas prestaciones y el mayor beneficio posibles, para lo cual los componentes rad-hard son menos atractivos
debido a su mayor coste y menores prestaciones que los componentes COTS existentes.
Sin embargo, los componentes COTS no han sido específicamente diseñados para su uso
en el espacio y típicamente no incluyen técnicas específicas para evitar que los efectos de
la radiación afecten su funcionamiento. Los componentes COTS se comercializan tal cual
son, y habitualmente no es posible modificarlos para mejorar su resistencia a la radiación.
Además, los elevados niveles de integración de los sistemas en chip (SoC) complejos
de altas prestaciones dificultan su observación y la aplicación de técnicas de tolerancia
a fallos. Este problema es especialmente relevante en el caso de los microprocesadores.
Por tanto, existe un gran interés en el desarrollo de técnicas que permitan conocer y
mejorar el comportamiento de los microprocesadores COTS bajo radiación sin modificar
su arquitectura y sin interferir en su funcionamiento para facilitar su uso en el espacio y
con ello maximizar las prestaciones de las misiones espaciales presentes y futuras.
En esta Tesis se han desarrollado técnicas novedosas para detectar, diagnosticar y
mitigar los errores producidos por radiación en microprocesadores y sistemas en chip
(SoC) comerciales, utilizando la interfaz de traza como punto de observación. La interfaz de
traza es un recurso habitual en los microprocesadores modernos, principalmente enfocado
a soportar las tareas de desarrollo y depuración del software durante la fase de diseño. Sin
embargo, una vez el desarrollo ha concluido, la interfaz de traza típicamente no se utiliza
durante la fase operativa del sistema, por lo que puede ser reutilizada sin coste. La interfaz
de traza constituye un punto de conexión viable para observar el comportamiento de un
microprocesador de forma no intrusiva y sin interferir en su funcionamiento.
Como resultado de esta Tesis se ha desarrollado un módulo IP capaz de recabar
y decodificar la información de traza de un microprocesador COTS moderno de altas
prestaciones. El IP es altamente configurable y personalizable para adaptarse a diferentes
aplicaciones y tipos de procesadores. Ha sido diseñado y validado utilizando el dispositivo
Zynq-7000 de Xilinx como plataforma de desarrollo, que constituye un dispositivo COTS
de interés en la industria espacial. Este dispositivo incluye un procesador ARM Cortex-A9
de doble núcleo, que es representativo del conjunto de microprocesadores hard-core
modernos de altas prestaciones. El IP resultante es compatible con la tecnología ARM
CoreSight, que proporciona acceso a información de traza en los microprocesadores ARM.
El IP incorpora técnicas para detectar errores en el flujo de ejecución y en los datos de la
aplicación ejecutada utilizando la información de traza, en tiempo real y con muy baja
latencia. El IP se ha validado en campañas de inyección de fallos y también en radiación con
protones y neutrones en instalaciones especializadas. También se ha combinado con otras
técnicas de tolerancia a fallos para construir técnicas híbridas de mitigación de errores.
Los resultados experimentales obtenidos demuestran su alta capacidad de detección y
potencialidad en el diagnóstico de errores producidos por radiación.
El resultado de esta Tesis, desarrollada en el marco de un Doctorado Industrial entre
la Universidad Carlos III de Madrid (UC3M) y la empresa Arquimea, se ha transferido satisfactoriamente al entorno empresarial en forma de un proyecto financiado por la
Agencia Espacial Europea para continuar su desarrollo y posterior explotación.Commercial electronic components, also known as Commercial-Off-The-Shelf (COTS),
are present in a wide variety of devices commonly used in our daily life. Particularly, the
use of microprocessors and highly integrated System-on-Chip (SoC) devices has fostered
the advent of increasingly intelligent electronic devices which sustain the lifestyles and the
progress of modern society. Microprocessors are present even in safety-critical systems,
such as vehicles, planes, weapons, medical devices, implants, or power plants. In any of
these cases, a fault could involve severe human or economic consequences. However, every
electronic system deals continuously with internal and external factors that could provoke
faults in its operation. The capacity of a system to operate correctly in presence of faults
is known as fault-tolerance, and it becomes a requirement in the design and operation of
critical systems.
Space vehicles such as satellites or spacecraft also incorporate microprocessors to
operate autonomously or semi-autonomously during their service life, with the additional
difficulty that they cannot be repaired once in-orbit, so they are considered critical systems.
In addition, the harsh conditions in space, and specifically radiation effects, involve a big
challenge for the correct operation of electronic devices. In particular, radiation-induced
soft errors have the potential to become one of the major risks for the reliability of systems
in space.
Large space missions, typically publicly funded as in the case of NASA or European
Space Agency (ESA), have followed historically the requirement to avoid the risk at any
expense, regardless of any cost or schedule restriction. Because of that, the selection of
radiation-resistant components (known as rad-hard) specifically designed to be used in
space has been the dominant methodology in the paradigm of traditional space industry,
also known as “Old Space”. However, rad-hard components have commonly a much higher
associated cost and much lower performance that other equivalent COTS devices. In fact,
COTS components have already been used successfully by NASA and ESA in missions
that requested such high performance that could not be satisfied by any available rad-hard
component.
In the recent years, the access to space is being facilitated in part due to the irruption
of private companies in the space industry. Such companies do not always seek to avoid
the risk at any cost, but they must pursue profitability, so they perform a trade-off between
risk, cost, and schedule through risk management in a paradigm known as “New Space”.
Private companies are often interested in deliver space-based services with the maximum
performance and maximum benefit as possible. With such objective, rad-hard components
are less attractive than COTS due to their higher cost and lower performance.
However, COTS components have not been specifically designed to be used in space
and typically they do not include specific techniques to avoid or mitigate the radiation effects in their operation. COTS components are commercialized “as is”, so it is not
possible to modify them to improve their susceptibility to radiation effects. Moreover,
the high levels of integration of complex, high-performance SoC devices hinder their
observability and the application of fault-tolerance techniques. This problem is especially
relevant in the case of microprocessors. Thus, there is a growing interest in the development
of techniques allowing to understand and improve the behavior of COTS microprocessors
under radiation without modifying their architecture and without interfering with their
operation. Such techniques may facilitate the use of COTS components in space and
maximize the performance of present and future space missions.
In this Thesis, novel techniques have been developed to detect, diagnose, and
mitigate radiation-induced errors in COTS microprocessors and SoCs using the trace
interface as an observation point. The trace interface is a resource commonly found
in modern microprocessors, mainly intended to support software development and
debugging activities during the design phase. However, it is commonly left unused
during the operational phase of the system, so it can be reused with no cost. The trace
interface constitutes a feasible connection point to observe microprocessor behavior in a
non-intrusive manner and without disturbing processor operation.
As a result of this Thesis, an IP module has been developed capable to gather and
decode the trace information of a modern, high-end, COTS microprocessor. The IP is highly
configurable and customizable to support different applications and processor types. The
IP has been designed and validated using the Xilinx Zynq-7000 device as a development
platform, which is an interesting COTS device for the space industry. This device features a
dual-core ARM Cortex-A9 processor, which is a good representative of modern, high-end,
hard-core microprocessors. The resulting IP is compatible with the ARM CoreSight
technology, which enables access to trace information in ARM microprocessors. The IP is
able to detect errors in the execution flow of the microprocessor and in the application data
using trace information, in real time and with very low latency. The IP has been validated
in fault injection campaigns and also under proton and neutron irradiation campaigns in
specialized facilities. It has also been combined with other fault-tolerance techniques
to build hybrid error mitigation approaches. Experimental results demonstrate its high
detection capabilities and high potential for the diagnosis of radiation-induced errors.
The result of this Thesis, developed in the framework of an Industrial Ph.D. between the
University Carlos III of Madrid (UC3M) and the company Arquimea, has been successfully
transferred to the company business as a project sponsored by European Space Agency to
continue its development and subsequent commercialization.Programa de Doctorado en Ingeniería Eléctrica, Electrónica y Automática por la Universidad Carlos III de MadridPresidenta: María Luisa López Vallejo.- Secretario: Enrique San Millán Heredia.- Vocal: Luigi Di Lill
Achieving Functional Correctness in Large Interconnect Systems.
In today's semi-conductor industry, large chip-multiprocessors and systems-on-chip are being developed, integrating a large number of components on a single chip. The sheer size of these designs and the intricacy of the communication patterns they exhibit have propelled the development of network-on-chip (NoC) interconnects as the basis for the communication infrastructure in these systems. Faced with the interconnect's growing size and complexity, several challenges hinder its effective validation. During the interconnect's development, the functional verification process relies heavily on the use of emulation and post-silicon validation platforms. However, detecting and debugging errors on these platforms is a difficult endeavour due to the limited observability, and in turn the low verification capabilities, they provide. Additionally, with the inherent incompleteness of design-time validation efforts, the potential of design bugs escaping into the interconnect of a released product is also a concern, as these bugs can threaten the viability of the entire system.
This dissertation provides solutions to enable the development of functionally correct interconnect designs. We first address the challenges encountered during design-time verification efforts, by providing two complementary mechanisms that allow emulation and post-silicon verification frameworks to capture a detailed overview of the functional behaviour of the interconnect. Our first solution re-purposes the contents of in-flight traffic to log debug data from the interconnect's execution. This approach enables the validation of the interconnect using synthetic traffic workloads, while attaining over 80% observability of the routes followed by packets and capturing valuable debugging information. We also develop an alternative mechanism that boosts observability by taking periodic snapshots of execution, thus extending the verification capabilities to run both synthetic traffic and real-application workloads. The collected snapshots enhance detection and debugging support, and they provide observability of over 50% of packets and reconstructs at least half of each of their routes. Moreover, we also develop error detection and recovery solutions to address the threat of design bugs escaping into the interconnect's runtime operation. Our runtime techniques can overcome communication errors without needing to store replicate copies of all in-flight packets, thereby achieving correctness at minimal area costsPhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/116741/1/rawanak_1.pd
Decompose and Conquer: Addressing Evasive Errors in Systems on Chip
Modern computer chips comprise many components, including microprocessor cores, memory modules, on-chip networks, and accelerators. Such system-on-chip (SoC) designs are deployed in a variety of computing devices: from internet-of-things, to smartphones, to personal computers, to data centers. In this dissertation, we discuss evasive errors in SoC designs and how these errors can be addressed efficiently. In particular, we focus on two types of errors: design bugs and permanent faults. Design bugs originate from the limited amount of time allowed for design verification and validation. Thus, they are often found in functional features that are rarely activated. Complete functional verification, which can eliminate design bugs, is extremely time-consuming, thus impractical in modern complex SoC designs. Permanent faults are caused by failures of fragile transistors in nano-scale semiconductor manufacturing processes. Indeed, weak transistors may wear out unexpectedly within the lifespan of the design. Hardware structures that reduce the occurrence of permanent faults incur significant silicon area or performance overheads, thus they are infeasible for most cost-sensitive SoC designs.
To tackle and overcome these evasive errors efficiently, we propose to leverage the principle of decomposition to lower the complexity of the software analysis or the hardware structures involved. To this end, we present several decomposition techniques, specific to major SoC components. We first focus on microprocessor cores, by presenting a lightweight bug-masking analysis that decomposes a program into individual instructions to identify if a design bug would be masked by the program's execution. We then move to memory subsystems: there, we offer an efficient memory consistency testing framework to detect buggy memory-ordering behaviors, which decomposes the memory-ordering graph into small components based on incremental differences. We also propose a microarchitectural patching solution for memory subsystem bugs, which augments each core node with a small distributed programmable logic, instead of including a global patching module. In the context of on-chip networks, we propose two routing reconfiguration algorithms that bypass faulty network resources. The first computes short-term routes in a distributed fashion, localized to the fault region. The second decomposes application-aware routing computation into simple routing rules so to quickly find deadlock-free, application-optimized routes in a fault-ridden network. Finally, we consider general accelerator modules in SoC designs. When a system includes many accelerators, there are a variety of interactions among them that must be verified to catch buggy interactions. To this end, we decompose such inter-module communication into basic interaction elements, which can be reassembled into new, interesting tests.
Overall, we show that the decomposition of complex software algorithms and hardware structures can significantly reduce overheads: up to three orders of magnitude in the bug-masking analysis and the application-aware routing, approximately 50 times in the routing reconfiguration latency, and 5 times on average in the memory-ordering graph checking. These overhead reductions come with losses in error coverage: 23% undetected bug-masking incidents, 39% non-patchable memory bugs, and occasionally we overlook rare patterns of multiple faults. In this dissertation, we discuss the ideas and their trade-offs, and present future research directions.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147637/1/doowon_1.pd
Low-cost error detection through high-level synthesis
System-on-chip design is becoming increasingly complex as technology scaling enables more and more functionality on a chip. This scaling and complexity has resulted in a variety of reliability and validation challenges including logic bugs, hot spots, wear-out, and soft errors. To make matters worse, as we reach the limits of Dennard scaling, efforts to improve system performance and energy efficiency have resulted in the integration of a wide variety of complex hardware accelerators in SoCs. Thus the challenge is to design complex, custom hardware that is efficient, but also correct and reliable.
High-level synthesis shows promise to address the problem of complex hardware design by providing a bridge from the high-productivity software domain to the hardware design process. Much research has been done on high-level synthesis efficiency optimizations. This thesis shows that high-level synthesis also has the power to address validation and reliability challenges through two solutions.
One solution for circuit reliability is modulo-3 shadow datapaths: performing lightweight shadow computations in modulo-3 space for each main computation. We leverage the binding and scheduling flexibility of high-level synthesis to detect control errors through diverse binding and minimize area cost through intelligent checkpoint scheduling and modulo-3 reducer sharing. We introduce logic and dataflow optimizations to further reduce cost. We evaluated our technique with 12 high-level synthesis benchmarks from the arithmetic-oriented PolyBench benchmark suite using FPGA emulated netlist-level error injection. We observe coverages of 99.1% for stuck-at faults, 99.5% for soft errors, and 99.6% for timing errors with a 25.7% area cost and negligible performance impact. Leveraging a mean error detection latency of 12.75 cycles (4150x faster than end result check) for soft errors, we also explore a rollback recovery method with an additional area cost of 28.0%, observing a 175x increase in reliability against soft errors.
Another solution for rapid post-silicon validation of accelerator designs is Hybrid Quick Error Detection (H-QED): inserting signature generation logic in a hardware design to create a heavily compressed signature stream that captures the internal behavior of the design at a fine temporal and spatial granularity for comparison with a reference set of signatures generated by high-level simulation to detect bugs. Using H-QED, we demonstrate an improvement in error detection latency (time elapsed from when a bug is activated to when it manifests as an observable failure) of two orders of magnitude and a threefold improvement in bug coverage compared to traditional post-silicon validation techniques. H-QED also uncovered previously unknown bugs in the CHStone benchmark suite, which is widely used by the HLS community. H-QED incurs less than 10% area overhead for the accelerator it validates with negligible performance impact, and we also introduce techniques to minimize any possible intrusiveness introduced by H-QED
Evaluation of Open-Source EDA Tool “EDA Playground”
With the advancement of Information Technology, the design, verification, and manufacturing of Integrated circuits have been challenging and time consuming. Unlike the software domain, Electronic Design Automation (EDA) tools are mostly commercially available, and access is limited to the students. An open-source EDA tool might help the students to initialize the learning process. This thesis showcases an open-source EDA platform, EDA Playground, where users can practice their hardware description language (HDL) codes, create a testbench to simulate their designs and synthesize their code.
The thesis shows how EDA Playground provides its users with the ability to write code in various HDLs, enabling them to evaluate their designs using a range of both commercial and freely available simulators. Additionally, it also shows how the platform helps in identifying and resolving design failures through the utilization of waveform viewing tool, EPwave, developed my EDA Playground and logs. It is also highlighted how users have the ability to employ commercial synthesizers in order to combine their codes, thereby facilitating the assessment of device utilization and circuit diagram.
Another notable objective of the thesis is to highlight the application of EDA Playground to the incorporate of UVM 1.2. A step-by-step UVM testbench of a simple SystemVerilog adder was developed and simulated as a part of the thesis. Prospective users have the opportunity to gain knowledge about this methodology by accessing educational resources, which encompass various tools and examples provided for their advantage.
The thesis provides an extensive array of use cases that showcase the varied functionalities provided by EDA Playground. This thesis extensively employs and evaluates the diverse resources offered on EDA Playground to determine their usefulness
Development of electronics for the VELO upgrade detector
Esta tesis cubre el diseño electrónico del detector de vértices (VELO) del
experimento LHCb del CERN. El VELO está situado rodeando el punto de colisión de los dos haces de protones
del LHC del CERN. Su diseño está lleno de restricciones que requieren diseños novedosos: minimizar la materia
cerca del punto de colisión, diseño de componentes que soporten radiación, transmisión de datos a alta tasa y el
procesado de los mismos, sincronización del sistema, etc. El trabajo presentado en esta tesis se centra en: por un
lado, la validación del hardware y sus diferentes prototipos, por otro lado, el diseño del firmware de las FPGAs
encargadas del control, sincronización y adquisición de datos del VELO
- …