309 research outputs found
Innovative Techniques for Testing and Diagnosing SoCs
We rely upon the continued functioning of many electronic devices for our everyday welfare,
usually embedding integrated circuits that are becoming even cheaper and smaller
with improved features. Nowadays, microelectronics can integrate a working computer
with CPU, memories, and even GPUs on a single die, namely System-On-Chip (SoC).
SoCs are also employed on automotive safety-critical applications, but need to be tested
thoroughly to comply with reliability standards, in particular the ISO26262 functional
safety for road vehicles.
The goal of this PhD. thesis is to improve SoC reliability by proposing innovative
techniques for testing and diagnosing its internal modules: CPUs, memories, peripherals,
and GPUs. The proposed approaches in the sequence appearing in this thesis are described
as follows:
1. Embedded Memory Diagnosis: Memories are dense and complex circuits which
are susceptible to design and manufacturing errors. Hence, it is important to understand
the fault occurrence in the memory array. In practice, the logical and physical
array representation differs due to an optimized design which adds enhancements to
the device, namely scrambling. This part proposes an accurate memory diagnosis
by showing the efforts of a software tool able to analyze test results, unscramble
the memory array, map failing syndromes to cell locations, elaborate cumulative
analysis, and elaborate a final fault model hypothesis. Several SRAM memory failing
syndromes were analyzed as case studies gathered on an industrial automotive
32-bit SoC developed by STMicroelectronics. The tool displayed defects virtually,
and results were confirmed by real photos taken from a microscope.
2. Functional Test Pattern Generation: The key for a successful test is the pattern applied
to the device. They can be structural or functional; the former usually benefits
from embedded test modules targeting manufacturing errors and is only effective
before shipping the component to the client. The latter, on the other hand, can be
applied during mission minimally impacting on performance but is penalized due
to high generation time. However, functional test patterns may benefit for having
different goals in functional mission mode. Part III of this PhD thesis proposes
three different functional test pattern generation methods for CPU cores embedded
in SoCs, targeting different test purposes, described as follows:
a. Functional Stress Patterns: Are suitable for optimizing functional stress during
I
Operational-life Tests and Burn-in Screening for an optimal device reliability
characterization
b. Functional Power Hungry Patterns: Are suitable for determining functional
peak power for strictly limiting the power of structural patterns during manufacturing
tests, thus reducing premature device over-kill while delivering high test
coverage
c. Software-Based Self-Test Patterns: Combines the potentiality of structural patterns
with functional ones, allowing its execution periodically during mission.
In addition, an external hardware communicating with a devised SBST was proposed.
It helps increasing in 3% the fault coverage by testing critical Hardly
Functionally Testable Faults not covered by conventional SBST patterns.
An automatic functional test pattern generation exploiting an evolutionary algorithm
maximizing metrics related to stress, power, and fault coverage was employed
in the above-mentioned approaches to quickly generate the desired patterns. The
approaches were evaluated on two industrial cases developed by STMicroelectronics;
8051-based and a 32-bit Power Architecture SoCs. Results show that generation
time was reduced upto 75% in comparison to older methodologies while
increasing significantly the desired metrics.
3. Fault Injection in GPGPU: Fault injection mechanisms in semiconductor devices
are suitable for generating structural patterns, testing and activating mitigation techniques,
and validating robust hardware and software applications. GPGPUs are
known for fast parallel computation used in high performance computing and advanced
driver assistance where reliability is the key point. Moreover, GPGPU manufacturers
do not provide design description code due to content secrecy. Therefore,
commercial fault injectors using the GPGPU model is unfeasible, making radiation
tests the only resource available, but are costly. In the last part of this thesis, we
propose a software implemented fault injector able to inject bit-flip in memory elements
of a real GPGPU. It exploits a software debugger tool and combines the
C-CUDA grammar to wisely determine fault spots and apply bit-flip operations in
program variables. The goal is to validate robust parallel algorithms by studying
fault propagation or activating redundancy mechanisms they possibly embed. The
effectiveness of the tool was evaluated on two robust applications: redundant parallel
matrix multiplication and floating point Fast Fourier Transform
Characterising lithium-ion battery degradation through the identification and tracking of electrochemical battery model parameters
Lithium-ion (Li-ion) batteries undergo complex electrochemical and mechanical degradation. This complexity is pronounced in applications such as electric vehicles, where highly demanding cycles of operation and varying environmental conditions lead to non-trivial interactions of ageing stress factors. This work presents the framework for an ageing diagnostic tool based on identifying and then tracking the evolution of model parameters of a fundamental electrochemistry-based battery model from non-invasive voltage/current cycling tests. In addition to understanding the underlying mechanisms for degradation, the optimisation algorithm developed in this work allows for rapid parametrisation of the pseudo-two dimensional (P2D), Doyle-Fuller-Newman, battery model. This is achieved through exploiting the embedded symbolic manipulation capabilities and global optimisation methods within MapleSim. Results are presented that highlight the significant reductions in the computational resources required for solving systems of coupled non-linear partial differential equations
The impact of soft errors in logic and its commercialisation in ARM IP
The significance of soft errors in logic has grown because of reduced memory
vulnerability and the shrinking dimensions of semiconductor technology coupled
with the increasing amount of logic integrated into a chip. Consequently, some
of ARM’s customers are concerned about how soft errors on the bus interconnect
will affect the dependability of their systems, since the interconnect is a critical
hub of communication in a SoC and represents a substantial and growing amount
of logic. With the rising complexity of their systems, the interconnect will
become larger and more complex in the future, adding to their concern. In this
work the impact of soft errors on the bus interconnect logic was investigated
and a product was developed to ameliorate the effects of such errors on ARM’s
customers’ products.
Methods to measure the SER of ARM IP were investigated by focusing on
logical masking, which is a component in the calculation of the SER. The effect
that the topology of a combinatorial logic circuit has on its logical masking rate
was considered by performing gate-level statistical fault injection on different
implementations of adder circuits. Significant variation in logical masking was
found ranging from a factor of 3.1 at a synthesis frequency of 100 MHz to a factor
of 2.1 at 900 MHz. This difference is explained in an original way by correlating
logical masking with the circuit’s path length and fan-out. These properties
could be used to create a static method of measuring the logical masking rather
than the current time-consuming method of dynamic simulation. Additionally,
nearly 30% of faults injected cause more than one error, which means that the
combinational SER will be underestimated if research does not take gate fan-out
into consideration. Using this methodology a circuit designer can now base his
choice or development of a circuit on its reliability as well as its performance,
power, and area. Studying the variation in the factors that affect the SER is
important to ensure accuracy in addressing customer requirements.
Although it is important to consider the rate of soft error occurrence, in this
work the impact of errors is demonstrated to be critical. Using protocol-level
fault injection it is shown that faults on the ARM AXI bus interconnect can have
a serious effect on the reliability of the entire SoC such as deadlock, memory
corruption, or undefined behaviour. Using a fault-path traversal algorithm,
it is demonstrated that traditional error detection codes are not sufficient at
preventing these failures when faults occur on certain AXI bus signals. This led
to the development of novel fault tolerant methods that provide protection for
these identified signals. Based on these developments, a product was proposed for
an add-on to the AXI bus interconnect that can detect, correct, and report logic
soft errors without changing the AMBA standard or the customer’s connecting
IP
Real-Time Trace Decoding and Monitoring for Safety and Security in Embedded Systems
Integrated circuits and systems can be found almost everywhere in today’s world. As their use increases, they need to be made safer and more perfor mant to meet current demands in processing power. FPGA integrated SoCs can provide the ideal trade-off between performance, adaptability, and energy usage. One of today’s vital challenges lies in updating existing fault tolerance techniques for these new systems while utilizing all available processing capa bilities, such as multi-core and heterogeneous processing units. Control-flow monitoring is one of the primary mechanisms described for error detection at the software architectural level for the highest grade of hazard level clas sifications (e.g., ASIL D) described in industry safety standards ISO-26262. Control-flow errors are also known to compose the majority of detected errors for ICs and embedded systems in safety-critical and risk-susceptible environ ments [5]. Software-based monitoring methods remain the most popular [6–8]. However, recent studies show that the overheads they impose make actual reliability gains negligible [9, 10]. This work proposes and demonstrates a new control flow checking method implemented in FPGA for multi-core embedded systems called control-flow trace checker (CFTC). CFTC uses existing trace and debug subsystems of modern processors to rebuild their execution states. It can iden tify any errors in real-time by comparing executed states to a set of permitted state transitions determined statically. This novel implementation weighs hardware resource trade-offs to target mul tiple independent tasks in multi-core embedded applications, as well as single core systems. The proposed system is entirely implemented in hardware and isolated from all monitored software components, requiring 2.4% of the target FPGA platform resources to protect an execution unit in its entirety. There fore, it avoids undesired overheads and maintains deterministic error detection latencies, which guarantees reliability improvements without impairing the target software system. Finally, CFTC is evaluated under different software i Resumo fault-injection scenarios, achieving detection rates of 100% of all control-flow errors to wrong destinations and 98% of all injected faults to program binaries. All detection times are further analyzed and precisely described by a model based on the monitor’s resources and speed and the software application’s control-flow structure and binary characteristics.Circuitos integrados estĂŁo presentes em quase todos sistemas complexos do mundo moderno. Conforme sua frequĂŞncia de uso aumenta, eles precisam se tornar mais seguros e performantes para conseguir atender as novas demandas em potĂŞncia de processamento. Sistemas em Chip integrados com FPGAs conseguem prover o balanço perfeito entre desempenho, adaptabilidade, e uso de energia. Um dos maiores desafios agora Ă© a necessidade de atualizar tĂ©cnicas de tolerância Ă falhas para estes novos sistemas, aproveitando os novos avanços em capacidade de processamento. Monitoramento de fluxo de controle Ă© um dos principais mecanismos para a detecção de erros em nĂvel de software para sistemas classificados como de alto risco (e.g. ASIL D), descrito em padrões de segurança como o ISO-26262. Estes erros sĂŁo conhecidos por compor a maioria dos erros detectados em sistemas integrados [5]. Embora mĂ©todos de monitoramento baseados em software continuem sendo os mais populares [6–8], estudos recentes mostram que seus custos adicionais, em termos de performance e área, diminuem consideravelmente seus ganhos reais em confiabilidade [9, 10]. Propomos aqui um novo mĂ©todo de monitora mento de fluxo de controle implementado em FPGA para sistemas embarcados multi-core. Este mĂ©todo usa subsistemas de trace e execução de cĂłdigo para reconstruir o estado atual do processador, identificando erros atravĂ©s de com parações entre diferentes estados de execução da CPU. Propomos uma implementação que considera trade-offs no uso de recuros de sistema para monitorar mĂşltiplas tarefas independetes. Nossa abordagem suporta o monitoramento de sistemas simples e tambĂ©m de sistemas multi-core multitarefa. Por fim, nossa tĂ©cnica Ă© totalmente implementada em hardware, evitando o uso de unidades de processamento de software que possa adicionar custos indesejáveis Ă aplicação em perda de confiabilidade. Propomos, assim, um mecanismo de verificação de fluxo de controle, escalável e extensĂvel, para proteção de sistemas embarcados crĂticos e multi-core
Real-time trace decoding and monitoring for safety and security in embedded systems
Integrated circuits and systems can be found almost everywhere in today’s world. As their use increases, they need to be made safer and more perfor mant to meet current demands in processing power. FPGA integrated SoCs can provide the ideal trade-off between performance, adaptability, and energy usage. One of today’s vital challenges lies in updating existing fault tolerance techniques for these new systems while utilizing all available processing capa bilities, such as multi-core and heterogeneous processing units. Control-flow monitoring is one of the primary mechanisms described for error detection at the software architectural level for the highest grade of hazard level clas sifications (e.g., ASIL D) described in industry safety standards ISO-26262. Control-flow errors are also known to compose the majority of detected errors for ICs and embedded systems in safety-critical and risk-susceptible environ ments [5]. Software-based monitoring methods remain the most popular [6–8]. However, recent studies show that the overheads they impose make actual reliability gains negligible [9, 10]. This work proposes and demonstrates a new control flow checking method implemented in FPGA for multi-core embedded systems called control-flow trace checker (CFTC). CFTC uses existing trace and debug subsystems of modern processors to rebuild their execution states. It can iden tify any errors in real-time by comparing executed states to a set of permitted state transitions determined statically. This novel implementation weighs hardware resource trade-offs to target mul tiple independent tasks in multi-core embedded applications, as well as single core systems. The proposed system is entirely implemented in hardware and isolated from all monitored software components, requiring 2.4% of the target FPGA platform resources to protect an execution unit in its entirety. There fore, it avoids undesired overheads and maintains deterministic error detection latencies, which guarantees reliability improvements without impairing the target software system. Finally, CFTC is evaluated under different software i Resumo fault-injection scenarios, achieving detection rates of 100% of all control-flow errors to wrong destinations and 98% of all injected faults to program binaries. All detection times are further analyzed and precisely described by a model based on the monitor’s resources and speed and the software application’s control-flow structure and binary characteristics.Circuitos integrados estĂŁo presentes em quase todos sistemas complexos do mundo moderno. Conforme sua frequĂŞncia de uso aumenta, eles precisam se tornar mais seguros e performantes para conseguir atender as novas demandas em potĂŞncia de processamento. Sistemas em Chip integrados com FPGAs conseguem prover o balanço perfeito entre desempenho, adaptabilidade, e uso de energia. Um dos maiores desafios agora Ă© a necessidade de atualizar tĂ©cnicas de tolerância Ă falhas para estes novos sistemas, aproveitando os novos avanços em capacidade de processamento. Monitoramento de fluxo de controle Ă© um dos principais mecanismos para a detecção de erros em nĂvel de software para sistemas classificados como de alto risco (e.g. ASIL D), descrito em padrões de segurança como o ISO-26262. Estes erros sĂŁo conhecidos por compor a maioria dos erros detectados em sistemas integrados [5]. Embora mĂ©todos de monitoramento baseados em software continuem sendo os mais populares [6–8], estudos recentes mostram que seus custos adicionais, em termos de performance e área, diminuem consideravelmente seus ganhos reais em confiabilidade [9, 10]. Propomos aqui um novo mĂ©todo de monitora mento de fluxo de controle implementado em FPGA para sistemas embarcados multi-core. Este mĂ©todo usa subsistemas de trace e execução de cĂłdigo para reconstruir o estado atual do processador, identificando erros atravĂ©s de com parações entre diferentes estados de execução da CPU. Propomos uma implementação que considera trade-offs no uso de recuros de sistema para monitorar mĂşltiplas tarefas independetes. Nossa abordagem suporta o monitoramento de sistemas simples e tambĂ©m de sistemas multi-core multitarefa. Por fim, nossa tĂ©cnica Ă© totalmente implementada em hardware, evitando o uso de unidades de processamento de software que possa adicionar custos indesejáveis Ă aplicação em perda de confiabilidade. Propomos, assim, um mecanismo de verificação de fluxo de controle, escalável e extensĂvel, para proteção de sistemas embarcados crĂticos e multi-core
Multi-core devices for safety-critical systems: a survey
Multi-core devices are envisioned to support the development of next-generation safety-critical systems, enabling the on-chip integration of functions of different criticality. This integration provides multiple system-level potential benefits such as cost, size, power, and weight reduction. However, safety certification becomes a challenge and several fundamental safety technical requirements must be addressed, such as temporal and spatial independence, reliability, and diagnostic coverage. This survey provides a categorization and overview at different device abstraction levels (nanoscale, component, and device) of selected key research contributions that support the compliance with these fundamental safety requirements.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness under grant TIN2015-65316-P, Basque Government under grant KK-2019-00035 and the HiPEAC Network of Excellence. The Spanish Ministry of Economy and Competitiveness has also partially supported Jaume Abella under Ramon y Cajal postdoctoral fellowship (RYC-2013-14717).Peer ReviewedPostprint (author's final draft
Single Event Effects Assessment of UltraScale+ MPSoC Systems under Atmospheric Radiation
The AMD UltraScale+ XCZU9EG device is a Multi-Processor System-on-Chip
(MPSoC) with embedded Programmable Logic (PL) that excels in many Edge (e.g.,
automotive or avionics) and Cloud (e.g., data centres) terrestrial
applications. However, it incorporates a large amount of SRAM cells, making the
device vulnerable to Neutron-induced Single Event Upsets (NSEUs) or otherwise
soft errors. Semiconductor vendors incorporate soft error mitigation mechanisms
to recover memory upsets (i.e., faults) before they propagate to the
application output and become an error. But how effective are the MPSoC's
mitigation schemes? Can they effectively recover upsets in high altitude or
large scale applications under different workloads? This article answers the
above research questions through a solid study that entails accelerated neutron
radiation testing and dependability analysis. We test the device on a broad
range of workloads, like multi-threaded software used for pose estimation and
weather prediction or a software/hardware (SW/HW) co-design image
classification application running on the AMD Deep Learning Processing Unit
(DPU). Assuming a one-node MPSoC system in New York City (NYC) at 40k feet, all
tested software applications achieve a Mean Time To Failure (MTTF) greater than
148 months, which shows that upsets are effectively recovered in the processing
system of the MPSoC. However, the SW/HW co-design (i.e., DPU) in the same
one-node system at 40k feet has an MTTF = 4 months due to the high failure rate
of its PL accelerator, which emphasises that some MPSoC workloads may require
additional NSEU mitigation schemes. Nevertheless, we show that the MTTF of the
DPU can increase to 87 months without any overhead if one disregards the
failure rate of tolerable errors since they do not affect the correctness of
the classification output.Comment: This manuscript is under review at IEEE Transactions on Reliabilit
GPU devices for safety-critical systems: a survey
Graphics Processing Unit (GPU) devices and their associated software programming languages and frameworks can deliver the computing performance required to facilitate the development of next-generation high-performance safety-critical systems such as autonomous driving systems. However, the integration of complex, parallel, and computationally demanding software functions with different safety-criticality levels on GPU devices with shared hardware resources contributes to several safety certification challenges. This survey categorizes and provides an overview of research contributions that address GPU devices’ random hardware failures, systematic failures, and independence of execution.This work has been partially supported by the European Research Council with Horizon 2020 (grant agreements No. 772773 and 871465), the Spanish Ministry of Science and Innovation under grant PID2019-107255GB, the HiPEAC Network of Excellence and the Basque Government under grant KK-2019-00035. The Spanish Ministry of Economy and Competitiveness has also partially supported Leonidas Kosmidis with a Juan de la Cierva Incorporación postdoctoral fellowship (FJCI-2020- 045931-I).Peer ReviewedPostprint (author's final draft
- …